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Preface 


The primary objective of this book is to prepare students for empirical 
research. But it also serves those who will go on to advanced study in 
econometric theory. Recognizing that readers will have diverse back- 
grounds and interests, I appeal to intuition as well as to rigor, and draw 
on a general acquaintance with empirical economics. I encourage 
readers to develop a critical sense: students ought to evaluate, rather 
than simply accept, what they read in journals and textbooks. 

The book derives from lecture notes that I have used in the first-year 
graduate econometrics course at the University of Wisconsin. Students 
enroll from a variety of departments, including agricultural economics, 
finance, accounting, industrial relations, and sociology, as well as eco- 
nomics. All have had a year of calculus, a semester of linear algebra, 
and a semester of statistical inference. Some have had much more course 
work, including probability theory, mathematical statistics, and econo- 
metrics. Others have had substantial empirical research experience. 

All of the material can be covered—indeed has been covered—in two 
semesters. To make that possible, I focus on a few underlying principles, 
rather than cataloging many potential methods. To accommodate stu- 
dents with varied preparation, the book begins with a review of ele- 
mentary statistical concepts and methods, before proceeding to the 
regression model and its variants. 

Although the models covered are quite standard, the approach taken 
is somewhat distinctive. ‘The conditional expectation function (CEF) is intro- 
duced as the key feature of a multivariate population for economists 
who are interested in relations among economic variables. The CEF 
describes how the average value of one variable varies with values of 
the other variables in the population—a very simple concept. Another 
key feature of a multivariate population is the linear projection, or best 
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linear predictor (BLP): it provides the best linear approximation to the 
CEF. Alternative regression models arise according to the sampling 
scheme used to get drawings from the population. 

The focus on CEF's and BLP's is useful. For example, whether a 
regression specification is "right" or "wrong," least-squares linear regres- 
sion will typically estimate something in the population, namely, the 
BLP. Instead of emphasizing the bias (or inconsistency) of least squares, 
one can consider whether or not the population feature that it does 
consistently estimate is an interesting one. This approach also avoids 
visualizing empirical relations as disturbed versions of exact functions. 
For the most part, "disturbances" are just deviations from a mean, 
rather than objects that must be added to theoretical relations to pro- 
duce empirical relations. 

The analogy principle is relied on to suggest estimators, which are then 
evaluated according to conventional criteria. Thus least-squares, instru- 
mental-variable, and maximum-likelihood estimators are made plausible 
by analogy before their sampling properties are studied. 

A pedagogical feature of the book is the introduction of technical 
ideas in simple settings. Many advanced items are covered in the context 
of simple regression. These include asymptotics, the effect of alternative 
sampling schemes, and heteroskedasticity-corrected standard errors. 
The asymptotic theory for the ratio of sample means in sampling from 
a bivariate population, derived in Chapter 10, serves as a prototype for 
much more elaborate problems. 

From Chapter 16 on, the exercises include real micro-data analyses. 
These are keyed to the GAUSS programming language, but can readily 
be adapted to other languages or packages. Virtually all of the exercises 
have been used as homework assignments or exam questions. 

I thank three cohorts of students at Wisconsin, and one class at 
Stanford (where a portion of the material was used in 1990), for pressing 
me on details as well as on exposition. Over the years, I have had the 
benefit of guidance and instruction by several past and present col- 
leagues at Wisconsin, including Guy Orcutt, Harold Watts, Glen Cain, 
Laurits Christensen, Gary Chamberlain, Charles Manski, and James 
Powell. I am particularly grateful to Gary Chamberlain for his close 
critical reading of an early version of the manuscript. Frank Wolak of 
Stanford provided helpful comments on a later version. They all will 
recognize their ideas here despite my attempts at camouflage. 
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І am fortunate to have had the expert editorial advice of Elizabeth 
Gretz at Harvard University Press, and the proofreading assistance of 
Donghul Cho and Sangyong Joo. For permission to quote or reproduce 
their work, I thank Thad W. Mirer, John J. Johnston, and Aptech 
Systems, Inc. Passages from Econometric Methods, 3d ed., by John J. 
Johnston, copyright © 1984 by McGraw-Hill, are reproduced with per- 
mission of McGraw-Hill, Inc.; Table 1.1 is adapted with permission of 
the Institute for Social Research from Consumer Behavior of Individual 
Families over Two and Three Years, edited by R. Kosobud and J. N. Morgan 
(Ann Arbor: Institute for Social Research, The University of Michigan, 
1964); Table A.6 is reprinted by permission of John Wiley & Sons, Inc., 
from Principles of Econometrics by Henri Theil, copyright © 1971 by John 
Wiley & Sons; Tables A.3 and A.5 are reprinted by permission of 
Macmillan Publishing Company from Economic Statistics and Econometrics, 
2d ed., by Thad W. Mirer, copyright © 1988 by Macmillan Publishing 
Company. 

Madison, Wisconsin 
November 1990 


A Course in Econometrics 


1 Empirical Relations 


1.1. Theoretical and Empirical Relations 


Most of economics is concerned with relations among variables. For 
example, economists might consider how 


* the output of a firm is related to its inputs of labor, capital, and raw 
materials; 

* the inflation rate is related to unemployment, change in the money 
supply, and change in the wage rate; 

* the quantity demanded of a product depends on household income, 
price of the product, and prices of substitute products; 

*the proportion of income saved varies with the level of family 
income; 

* the earnings of a worker are related to her age, education, race, 
region of residence, and years of work experience. 


In theoretical economics, the relations are characteristically treated as 
exact relations, that is, as deterministic relations, that is, as (single- 
valued) functions. For example, consider the relation between savings 
and income. Let 


Y — savings rate — savings/income — proportion of income saved, 
X — income. 


In theoretical economics, one might consider Y = g(X), where g(-) is a 
function in the mathematical sense, that is, a single-valued function. 
Henceforth we will always use the word "function" in this strict sense. 
Corresponding to each value of X, there is a unique value of Y. An 
economist might ask such questions as: Is g(X) constant with respect to 
X? Is g(X) increasing in X? Is g(X) linear in X? 
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The same applies when there are several explanatory variables Xi, 
..., Ху, as when a firm's output is related to its inputs of labor, capital, 
and raw materials. In theory one considers Y = g(Xi, . . . , X,), where 
corresponding to each set of values for Xi, . . . , Ху, there is a unique 
value of Y. So g is again a (single-valued) function. The relation of Y to 
the X's is an exact one, that is, a deterministic one. 

This is what relations look like in theory. What happens when we look 
at empirical relations, that is, at real-world data on economic variables? 

Table 1.1 refers to 1027 U.S. "consumer units" (roughly, families) 
interviewed by the University of Michigan's Survey Research Center in 
1960, 1961, and 1962. Income is averaged over the two years 1960 and 
1961; the savings rate is the ratio of two-year savings to two-year income. 
In the source, the data were presented in grouped form, with ten 
brackets for income and nine brackets for the savings rate. For conve- 
nience, we have assumed that all observations in a bracket were located 
at a single point (the approximate midpoint of the bracket) and have 
labeled the values of X and Y accordingly. Across the top of the table 
are the ten distinct values of X = income (in thousands of dollars), which 
we refer to as x; (i = 1,..., 10). Down the left-hand side of the table 
are the nine distinct values of Y — savings rate, which we refer to as y; 


Table 1.1 Joint frequency distribution of X = income and Y = savings rate. 


.50 .001 .011  .007 .006  .005 .005 .008 .009  .014 
.40  .001 002  .006  .007 .010  .007  .008 .009  .008 
25 .002  .006 .004  .007  .010 .011  .020 .019  .013 
15  .002 009 .009  .012 016  .020  .042  .054 .094 
.05  .010 .023 .033 .031 .041 029 .047  .039 .042 


—.05 .001 012  .011 005 .012 016 017 014  .004 
—.18  .002 .008 .013 .006  .009 008 .008 .008  .006 
—.25 .009  .009 010  .006 .009  .007  .005 .003  .002 


p) 


X 


0.5 1.5 2.5 3.5 4.5 5.5 6.7 8.8 12.5 


.013 .013 .000  .002 .001 000 00 .000 000 


.041 .093 .093 .082 . .113 .108 .155 155 .113 


Source: Adapted from R. Kosobud and J. N. Morgan, eds., Consumer Behavior of Individual 
Families over Two and Three Years (Ann Arbor: Institute for Social Research, The University of 
Michigan, 1964), Table 5—5. 
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(J= 1,...,9). So there are 90 = 10 x 9 cells in the cross-tabulation. 
In the т, j cell one finds 


р(х, yj) = the proportion of the 1027 families who reported 
the combination (X = x; and Y = yj). 


This table gives the joint frequency distribution of Y and X for this data 
set. 

Here is some general notation for a joint frequency distribution of 
variables X and Y, where X takes on distinct values x; (i = 1,..., Г) and 
Y takes on distinct values у; (j = 1,..., J). The joint frequencies p(x;, y;) 
are defined for each of the I x J cells. Clearly Х,У, р(х, y) = 1, where 
>, = Ууу, У, = XL, From the joint frequency distribution it is easy to 
calculate the marginal frequency distribution of X: 


bx) = > Р(х. y) 


= proportion of observations having X = x; 
@=1,...,2). 


Then 2%, p(x;) = Zj(2;p(x; уй] = 1. 

Return to the joint frequency distribution of Table 1.1. For each of 
the ten columns, add down the rows to get the marginal frequency 
distribution p(x) in the last row—the bottom margin—and observe that 
the entries in the last row do add up to 1. 

Evidently, the empirical relation between Y and X is not a deterministic 
one. For if it were, then in any column of the body of the table, there 
would be only a single nonzero entry. But in every column, there are 
several nonzero entries. Indeed, in most columns, all nine entries are 
nonzero. Corresponding to each value of X, there is a whole set of 
values of Y rather than a single value of Y. What we see is a distribution 

rather than a function. This is characteristic of the real world: empirical 
` relations are not deterministic, not exact, not functional relations. 

Now focus attention on the distribution of Y corresponding to a 
particular value of X. Take X — xj, say, and ask what proportion of the 
observations that have X = х;, also have the values Y = 9,,..., yj. The 
answers give the conditional frequency distribution of Y given X = х;: 


Рох, у) 


WS Gre 


р(х) = 
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It follows for each i = 1,..., T that 


уу РО у) _ 25р у) po) _ 
Р 5 D рЫ 


Divide the entries in each column of Table 1.1 by the column sum. 
The resulting Table 1.2 gives the conditional frequency distributions of 
Y given X, one such distribution for each distinct value of X. Observe 
that each column sum in this table is equal to 1. The nondeterministic 
character of empirical relations is again apparent. If Y — g(X) as in 
theoretical economics, each column in the body of Table 1.2 would have 
a single unity, all other entries being zero. But Table 1.2 does not look 
like that. 

So we face a dilemma. We would like to use economic theory to guide 
our analysis of data, and to use data to implement the theory. But the 
savings-income relation in economic theory is deterministic, while in the 
empirical data it is not deterministic. How shall we resolve the dilemma? 

The theory seems to say that all families with the same value of X 
should have the same Y. If so, the data seem to indicate that these 
families did not do what they should. Perhaps they tried to, but made 
mistakes? If so, the conditional distributions are all due to error—there 


Table 1.2 Conditional frequency distributions of Y = savings rate for given values of 
Х = income. 


.50 024 118 075  .073  .044 049  .052 1058 194 
.40 024  .022 .064  .086  .088  .068  .052  .058  .071 
.25 049 .064 .043  .086 088 107 129 193 115 
15 049 .097  .097 .146  .142 194 .27] 348 .219 
.05 244  .247 355 .378 .363 281 .303 .959  .372 


-.05 024 129 .118 .061  .106 155 .109  .090  .035 


X 
0.5 1.5 2.5 3.5 4.5 5.5 6.7 8.8 12.5 


317 1140  .000  .024  .009  .000  .000  .000  .000 


—.18 049 .086 140 .073 080 078 .052 052 .053 


—.25 .220 | .097 .108 073 080 068  .032 .019 018 


Total 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 


My |x —.012 065  .048  .099  .079 083  .112 .129  .154 
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is a true value of Y for each value of X, but the families erred in their 
savings behavior, or perhaps in reporting savings to the interviewer. 
This is surely possible, but to rely on errors alone is unappealing. 

We know that the families differ in characteristics other than income 
that may be relevant to their savings behavior. The gap between theory 
and reality might diminish if the theory introduced more explanatory 
variables, Хә, . . . , X}. Then instead of looking at p(y,|x;) we would be 
looking at p(y;|x1;. - . . , xj). Presumably there will be less dispersion of 
Y within those narrowly defined cells than there is in the coarsely defined 
cells of our tables. But even then the empirical relation would not be 
deterministic. For example, consider all households who have the same 
income, family size, and race. We would still see differences in their Y 
values. Because a gap would remain in any case, for present purposes 
we may as well continue with the single-X case. 


1.2. Sample Means and Population Means 


To resolve the dilemma, we first reinterpret the economic theory. When 
the theorist speaks of Y being a function of X, let us say that she means 
that the average value of Y is a function of X. If so, when she says that 
g(X) increases with X, she means that on average, the value of Y increases 
with X. Or, when she says that g(X) is constant, she means that the 
average value of Y is the same for all values of X. With that interpretation 
in mind, let us re-examine our data set, seeking the empirical counter- 
part of the theorist's average value. 

Here is some algebra that shows how to calculate the average of a 
frequency distribution. First, for the variable X = income: if the mar- 


ginal frequency distribution of X is given by р(х) @ = 1,..., Г), then 
the marginal mean of X is 

тх = > хф(х;). 
Similarly, the marginal mean of Y is m, = 2;y;p(yj. Further, if the 


conditional frequency distribution of Y given X = x; is p(yx) (j = 1, 
...,J), then the conditional mean of Y given X = x, is 


туы, = È уру). 
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There are I such conditional means, one for each distinct value of X. 
Observe that the average of the conditional means equals the marginal 
mean: 


Утре) = Z [ро] 


xd > > Ур, y) 
= Ху [x pæ Уд] = È уро) = т. 


Return to Table 1.2. The conditional means of У have been calculated, 
one for each of the ten values of X, and are presented in the last row 
of the table. If we extract the top row (the x;) and the bottom row (the 
ту |), ме have the conditional mean function, or cmf, for Y given X, which 
we will refer to as тух. 

The cmf is a deterministic relation—that is, a function—in our data. 
The cmf specifies how the average value of Y is functionally related to 
X in the data set. For an economist who is concerned with the relation 
of the savings rate to income, this cmf my,x is the most interesting 
feature of the joint frequency distribution. We can plot it, and study it 
in terms of the economic theorist's concerns: Does my,x vary with X? 
Does it vary linearly with X, that is, is Am/AX constant? 

In Figure 1.1, the ten points that make up the cmf are plotted and, 
for convenience, are connected by line segments. Looking at the plot, 
we see a cmf that is too ragged and erratic to be taken seriously by a 
theorist. So a gap remains between theory and reality. 

To proceed, we recognize that the theorist who discussed the relation 
between the savings rate and income was not talking about my, for 
these particular 1027 families in 1960—1961. If the Survey Research 
Center bad happened to interview a different 1027 families, or even a 
1028th family, or even the same 1027 families in a different year, we 
would have had a different p(x, y) table, different p(y|x) columns, and 
no doubt a different т, x function. 

The next step is obvious. We suppose that what we observe is only a 
sample from a population. Our cmf displays sample means, not population 
means. Presumably the theorist was referring to population means, not 
sample means. It will be adequate for present purposes to think of the 
population itself as represented by a joint frequency distribution, one 
that refers to millions of families rather than to our 1027. (For conve- 
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Savings rate 
015 0.20 


0,10 


0.05 


0.00 


12 14 16 18 20 


Income (thousands of dollars) 


-0,05 


Figure 1.1 Conditional mean function: savings rate on income. 


nience, we continue to suppose that the X, Y pairs are confined to the 
same 90 combinations.) In the population the joint frequencies are given 
by 7(x;, yj), say, with 5,5, т(х;, yj) = 1. So in the population, the marginal 
frequencies of X are 


T(x) = D т(х 3), 
J 
and the conditional frequencies of Y given X are 
(9, | xi) = mx; )/m(x;). 
Further, the population mean of X is 


Bx = > хүт(х;), 


and the population conditional means of Y given X аге 


Py px; = > PTY | x,). 
j 
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We have arrived at the following position. When a theorist talks about 
Y = g(X), she is really referring to the population conditional mean 
function рух = g(X), which is indeed a function of X. We now have 
the theorist referring to the population features m(x, y), т(у |x), bey | x 
while the empirical material refers to the sample features f(x, y), p(y | х), 
my |x- This leaves us with the gap between the hypothetical population 
T's and p’s and the observed sample f's and m’s. 


1.3. Sampling 


Imagine this physical representation of the population and sample. Each 
family in the population is represented by a chip on which its (X, Y) 
pair is printed. The millions of chips are in a barrel. The joint frequency 
distribution in the barrel is т(х, у). We draw 1027 chips with replace- 
ment, record the x, y combinations, and tabulate the (relative) frequen- 
cies as p(x, y). Our p(x, y) table is just one of the possible p(x, y) tables 
that might have been obtained in this manner. Our data set is just one 
sample from the population. In general none of the possible sample 
p(x, y) tables will be identical to the population m(x, y) table, and none 
of the possible sample туух functions will coincide with the population 
By jx function. 

The dilemma has been substantially resolved. The questions that 
remain include: What sort of samples come from a population? How 
do sample joint frequency distributions, conditional frequency distri- 
butions, and cmf's depart from the population joint frequency distri- 
bution, conditional frequency distribution, and cmf? How can we best 
use a sample to learn about the population from which it came? How 
confident can we be in our conclusions? These are precisely the ques- 
tions that arc addressed in classical statistical theory. 


1.4. Estimation 


A large part of empirical econometrics is concerned with estimating 
population conditional mean functions from a sample. That is, econo- 
mists very often want to learn how the average value of one variable 
varies in a population with one, or several, other variables. 

If so, what remains to be discussed? After all, in introductory statistics 
courses, we have learned all about estimating population means. In 
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particular we have learned that the sample mean is an attractive esti- 
mator of a population mean—perhaps even that it is the best estimator. 
That attractiveness should carry over to the present situation, where we 
are concerned with a population conditional mean function. A popula- 
tion cmf is just a set of population means, and our joint sample can be 
viewed as a collection of conditional subsamples. So it is natural to use 
the sample conditional means as estimates of the population conditional 
means. That is, it is natural to take my, as the estimate of Ьу ү, thus 
taking my,x as the estimate of рух, bearing in mind that m 75 p. 

But is that always the right way to proceed? Suppose that an economic 
theory says that the population cmf for the savings rate on income is 
linear: py,x = а + BX, with a and B unknown. As Figure 1.1 shows, 
our sample my,x is not linear in income—the ten ту |,/ do not fall on 
a straight line. As empirical economists who wish to be guided by eco- 
nomic theory, shall we retain the ten sample my,,'s as they stand? Or 
shall we smooth the sample my,,,s by fitting a straight line to the ten 
points, obtaining mf,x = a + bX, and use those mf, = a + bx; @ = 1, 
..., 10) as the estimates of the ру, thus using a and b as the estimates 
of a and В? If we decide to smooth, how shall we fit the line? And do 
we know that the smoothed estimates are better than the sample means 
as estimates of the population means? Or suppose a theory said that the 
population cmf is exponential: py, = &Х?. How should we fit that 
curve? And does the smoothed sample line mf, still tell us anything 
about the population curve jy, x? 

These are typical of the issues that arise in this book. To address 
them seriously, we turn to a review of the framework provided by the 
random variable—probability distribution model of classical statistics. 


Exercises 


The following all refer to the empirical joint frequency distribution of 
Tables 1.1 and 1.2. 


1.1 Calculate the marginal frequency distribution of Y. Then calculate 
the mean of the conditional means of Y, verifying that it equals the 
marginal mean of Y (up to round-off error). 


1.2 Calculate the conditional frequency distributions of X given Y, 
and the conditional mean function of X given Y. 
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1.3 Plot the two conditional mean functions тух and тууу on a single 
diagram, using the horizontal axis for x-values and the vertical axis for 
y-values. Comment on the differences between those two functions. 


‚ 1.4 Let 7 = savings (in thousands of dollars), so Z = XY. The savings 
of a family with income x; and savings rate y; is z; = х;у;, so that the 
mean savings for the families in our sample is given by 


mz = X рэ х;у, Р(х, yj). 


Will this equal mymy,? That is, can mean savings be obtained by multi- 
plying mean savings rate by mean income? Explain. 


2 Univariate Probability Distributions 


2.1. Introduction 


The general framework of probability theory involves an experiment 
that has various possible outcomes. Each distinct outcome is represented 
as a point in a set, the sample space. Probabilities are assigned to certain 
outcomes in accordance with certain axioms, and then the probabilities 
of other events, which are subsets of the sample space, are deduced. 
Let S denote the sample space, A denote an event, and Pr(.) the prob- 
ability assignment. Then the axioms are 0 x Pr(A) = 1, Pr(S) = 1, and, 
where Aj, Ag, . . . are disjoint events, Pr(U; Aj) = 2;Pr(A)). 

We proceed somewhat more concretely. The distinct possible out- 
comes are identified, that is, distinguished, by the value of a single 
variable X. Each trial of the experiment produces one and only one 
value of X. Herc X is called a random variable, a label that merely indicates 
that X is a variable whose value is determined by the outcome of an 
experiment. The values that X takes on are denoted by x. So we may 
refer to events such as {X = x} and {X = x}. We distinguish two cases 
of probability distributions: discrete and continuous. 


2.2. Discrete Case 


In the discrete case, the number of distinct possible outcomes is either 
finite or countably infinite, so one can compile a list of them: ху, 
хә, .. . . The convention is to list these mass points in increasing order: 
Xj < x; <.... The assignment of probabilities is done via a function 
Дх), with these properties: 
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f(x) = 0 everywhere, 
f(x) = 0 except at the mass points x}, %,..., 


È fe) = 1, 


where 2; denotes summation over all the mass points. The function f(-) 
is called a probability mass function, or pmf. 

The initial assignment of probabilities is Pr(X = x) = f(x). That is, the 
probability that the random variable capital X takes on the value low- 
ercase x is f(x). Then the probabilities of various events are deducible 
by rules of probability theory. For example, supposing that the list is in 
increasing order, Pr(X = хь) = X2, f(x). For another example, if хо is 
not a mass point, then Р(Х = xo) = f(xo) = 0. 

Observe that the pmf f(-) has exactly the formal properties that f(-) 
had in univariate frequency distributions. (And observe the perverse 
notation: we used f(.) for frequency distributions, and now use f(-) for 
probability distributions.) Because of the formal resemblance, it may be 
helpful to interpret the pmf f(-) as the т(.) of Chapter 1, namely the 
frequency distribution in a population. 

Here are several examples of discrete univariate probability distri- 
butions: 

(1) Bernoulli with parameter p (0 = p = 1). Here 


fe) =p- pt forx = 0,1, 


with f(x) = 0 elsewhere. So Pr(X = 0) = f(0) = £*(1 — р) = 1 — f, 
Pr(X = 1) = f(1) = p'(1 — pf = p, and Pr(X = x) = f(x) = 0 for all other 
values of x. Observe that f(x) = 0 everywhere, that f(x) = 0 except at 
the two mass points x = 0 and x = 1, and that Z;f(x) = f(0) + f(1) = 1, 
as required. So this is a legitimate pmf. 

In what contexts might the Bernoulli distribution be relevant? That 
is, for what experiments might it be appropriate? A familiar example 
is a coin toss: X = 1 if heads, X = 0 if tails. The Bernoulli pmf says that 
Pr(X = 1) = р, Pr(X = 0) = 1 — р. Special cases are p = 0.5 (fair coin), 
and p = 0.7 (loaded coin). A more interesting example concerns unem- 
ployment. Let X = 1 if unemployed, X = 0 otherwise, the experiment 
being drawing an adult at random from the U.S. population. The 
Bernoulli pmf says that the probability of being unemployed is p. 
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(2) Discrete Uniform with parameter N (N positive integer). Here 
№) = VN forx = 1, 2,...,М, 


with f(x) = 0 elsewhere. Observe that f(x) 2 0 everywhere, that f(x) = 0 
except at the N mass points, and that Z;f(x) = ШМ +... + VN = 1. 
So this is a legitimate pmf. 

In what contexts might a discrete uniform distribution be relevant? 
A very familiar example is the roll of a fair die: X = the number on the 
face that comes up, and N = 6. This discrete uniform pmf says that 
Pr(X = 1) = Pr(X = 2) =... = Pr(X = 6) = 1/6. 

(3) Binomial with parameters n, р (n positive integer, 0 = р = 1). Here 


! x пх) 
foco —5)"? forx =0,1,2,...,n, 


with f(x) = 0 elsewhere. (Recall factorial notation: 0! = 1, 1! = 1, 2! = 
2, 3! = 6,... .) Observe that f(x) = 0 everywhere, that f(x) = 0 except 
at the mass points, and (as can be confirmed by summing from 0 to n) 
that Zjf(x) = [p + (1— р)]" = 1. 

In what contexts might the binomial distribution be relevant? Suppose 
we toss n identical coins at once, and let X = number of heads. That is, 
we run the Bernoulli(p) experiment n times, independently, and record 
the number of 1’s. Or if we observe an adult over n months, let X = 
number of months unemployed. The binomial distribution may be 
appropriate. 

Special cases of the binomial include: 


(a) n-7Lkf0)-1xf$'1—-5-0-p). 
К) =1хр(1- р = р. 


So the binomial distribution with parameters (1, f) is the same as the 
Bernoulli distribution with parameter f. 


(b) п = 2: 00) = 1 x pl — р)? = (1— py, 
К) = 2х p- p = 81 — р), 
А) = 1х р – р)? = р. 


Clearly f(0) + Д1) + 2) = [p + (1 - pf = 1. 
(4) Poisson with parameter À (X > 0). Here 


Дх = е^ “Хх! forx = 0, 1, 2,..., 
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with f(x) = 0 elsewhere. Observe that f(x) = 0 everywhere, that f(x) = 0 
except at the mass points, and (using the series expansion 


e= E Wa) = LEA +A? + А6 +...) 
х=0 


that У, /(х;) = 1. In the Poisson distribution, the number of distinct 
possible outcomes is countably infinite. 

Applications of the Poisson distribution might include the number of 
phone calls received at a switchboard in an hour, or the number of job 
offers an individual receives in a year. 


2.3. Continuous Case 


In the continuous case, we again consider an experiment whose outcomes 
are distinguished by the value of a single real variable X. But now there 
is a continuum of distinct possible outcomes, so we cannot compile them 
in a list. | 

The assignment of probabilities is done via a function f(x) with these 
properties: f(x) = 0 everywhere, f=. f(x) dx = 1. This function f(.) is 
called a probability density function, or pdf. The initial assignment of 
probabilities via f(x) is as follows. For any pair of numbers a, b with 
а= 6: 


b 
Pra@=X = b) = Í f(x) dx. 


That is, the probability that the random variable X lies in the closed 
interval [a, b] is given by the area under the f(x) curve between the 
points a and b. 

To see what we are committed to in the continuous case, consider 
several specific events: 


(1) A={-~ =X = о}, 


Here a = —o, b = c, so Pr(A) = f. f(x) dx = 1, as it should, since A 
exhausts the sample space. 


(2) A={X sb} ={-7 =X = Б). 
Here a = —o, so Pr(A) = Ј® f(x) dx. 
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(3) А = {Х =a}={asX Sa}. 


Here b = a, so Pr(A) = f3 Дх) dx = 0. 

Consider (3), which says that in the continuous case Р(Х = х) = 0 
for every x. This means that the probability that X takes on a particular 
value x is zero, for every such particular value. And yet on every run 
of the experiment some value of x is taken on. Is that a contradiction? 
No, not unless one confuses two distinct concepts, zero probability and 
impossibility. In the continuous case, a zero-probability event is not an 
impossible event. Although this seems awkward, no other assignment 
of probabilities to events of the form (X = x} is possible when the distinct 
possible outcomes form a continuum. 

Further, the following events all have the same probability, namely 
№ fix) dx: 

A, = {а= Х = 5}, А; = {a = X < b), 
As ={a<X<=b}, А, = {\а<Х <). 


For example, А, = А, U A, where A, = (X = b}. But А, and A, аге 
disjoint, and Рү(А,) = 0, so Pr(A5) = Pr(A)). 
The cumulative distribution function, or cdf, is defined as 


кө) = | foa. 


with t being a dummy argument. The cdf gives the area under the pdf 
from —© up to x, so F(x) = Pr(X = x). Some properties of a cdf are 
immediate: 
e F(—o) = 0, Е(о) = 1. 
* F(-) is monotonically nondecreasing (because f(t) = 0). 
* Wherever differentiable, dF(x)/dx = f(x), because F = f f(t) dt, and 
the derivative of an integral with respect to its upper limit is just the 
argument (the integrand) evaluated at the upper limit. 


In the continuous case the cdf is convenient because 


Pra =X = Б) = [е dx = [fo dx — Гл dx 


= F(b) — F(a). 


The cdf could have been introduced in the discrete case as F(x) — 
Pr(X = x), but it is not so crucial there. 
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Here are several examples of continuous univariate probability dis- 
tributions: 

(1) Rectangular (or continuous uniform) on the interval [a, b], with 
parameters a < b. The pdf is 


Дх) = 14(ф—-а) forasx=b, 


with f(x) = 0 elsewhere. Observe that f(x) = 0 everywhere, and that 
eo a b со 
| ё = | ode + | Ш — ay) dx + | 0 dx 
о -œ a b 


= [1b — a)] хш = 1. 


(Note: The symbol || is used to denote an integral to be evaluated.) So 
this is a legitimate pdf. It plots as a rectangle, with base b — a and height 
l/(b — a); the area of the rectangle is base X height — 1. The cdf is 


` 0 forx <a, 
F(x) = Í ft) dt = 4 (x — а) —a) fora =x £b, 
x: 1 for b <x. 


(2) Exponential with parameter А > 0. The pdf is 
f(x) = № е^ forx> 0, 


with f(x) = 0 for x = 0. The relevant indefinite integral is 
fr е“ = А [e dt = А (e“)(-A) = е, 


so the cdf is 


_ 40 for x = 0, 
EIS | —e&* {огх > 0. 


The exponential pdf and cdf for \ = 2 are plotted in Figure 2.1. 

The exponential distribution may be appropriate for the length of 
time until a light bulb fails. It may also be relevant for the duration of 
unemployment among those who leave a job, with time being measured 
continuously. 

(3) Standard Normal. The standard normal distribution plays a central 
role in statistical theory. The pdf is 


fx) = (9m)? exp(—x?/2), 


2.3 Continuous Case 17 


о 
m N 
05 
ш. 
= 
x 
ы 
ч 
n 
9 F(x) 
n #0) 
о 
о 
0.0 05 10 15 20 25 3.0 


Figure 2.1 Exponential distribution: pdf and cdf, A = 2. 


which plots as a familiar bell-shaped curve, as in Figure 2.2. Some 
features of the curve are apparent by inspection of the pdf formula. 
The curve is symmetric about zero: f(—x) = f(x). The ordinate at zero 
is f(0) = 1/V (27) = 0.3989. The slope is 


f'(x) = (22)? exp(—x7/2)(—x) = —xf(x). 


So f'(x) > 0 for x < 0, f'(x) = 0 for x = 0, f'(x) < 0 for x > 0. The 

second derivative is f"(x) = —[xf'(x) + f()] = —[—x" f(x) + f(x)] = 

(x? ~1)f(x). So the curve has inflection points at x = 1 and x = —1. 
The cdf is 


F(x) = Pr(X Sx) = T Дә dt. 


No closed form is available, but the standard normal cdf is plotted in 
Figure 2.2 and tabulated in Table A.1. The tabulation is confined to 
x > 0, which suffices because the symmetry of f(x) about 0 implies that 
F(—x) = 1 — F(x). 
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Figure 2.2 Standard normal distribution: pdf and cdf. 
(4) Standard Logistic. 'The pdf is 


fix) = e + ey. 


It is easy to verify that this pdf plots as a symmetric bell-shaped curve, 
very similar to the standard normal, and that the cdf is 


F(x) = f f(t) dt = еқі + е). 


(5) Power on interval [0, 1] with parameter Ө > 0. The pdf is 
Хх) = 0x*! юг0=х=1, 


with f(x) = 0 elsewhere. The relevant indefinite integral is 


e P d-9 i dt = (0/0) ё = ff, 
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so the cdf is 


0 forx«0, 
F(x) = 4х forOsx<1, 
І forx> 1. 


The power pdf and cdf for 0 = 3 are plotted in Figure 2.3. The power 
distribution may have no natural application, but we will use it for 
examples because the integration is so simple. 


2.4. Mixed Case 


A mixture of the discrete and continuous cases may also be relevant. 
Let X = dollars spent in a year on car repairs, the experiment being 
drawing a family at random from the U.S. population. An appropriate 
model would allow for a mass point at X = 0, with a continuous distri- 
bution over positive values of X. 


f(x) , F(x] 
0,5 10 15 2,0 25 3.0 
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Figure 2.3 Power distribution: pdf and cdf, 0 = 3. 
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2.5. Functions of Random Variables 


We now have in hand a stock of univariate distributions to draw upon. 
Once probabilities are initially assigned via f(x), we are committed to 
many other probabilities. Here are two trivial examples: 

(a) Suppose that X ~ Poisson(1), that is, the random variable X has 
the Poisson distribution with parameter 1. Let A = {2 < X = 4}. To 
find Pr(A), write A = A, U А,, where A, = (X = 3}, А, = {X = 4}. 
Because A, П А, = 0, it follows that Pr(A) = Pr(A,) + Pr(As) = f(3) + 
f(4). But f(x) = e x! = (ех), so f(3) = 0.0618, f(4) = 0.0153, and 
Pr(A) = 0.0766. 

(b) Suppose that X ~ standard normal. Let A = {|X| « 2). To find 
Pr(A), the calculation is 


Pr(|X| < 2) = Pr(-2 < X < 2) = F(2) - F(-2) = F(2) – [1 – F(2)] 
= 2 F(2) — 1 = 0.954, 


using Table A.1 to get F(2) — 0.977. 

Now suppose Y = h(X) is a (single-valued) function of X. Let B be any 
event that is defined in terms of Y. Then B can be translated into an 
event defined in terms of X, so we can deduce Pr(B). Indeed we can 
deduce the probability distribution of the random variable Y. 

We illustrate the procedure with a few examples. 

(1) Suppose that the pmf of X is 


1/8 forx = —1, 
Хх) = 4218 forx= 0, 
5/8 forx= 1, 


with f(x) = 0 elsewhere, and suppose that Y = X*. The possible outcomes 
for Y are 0 (which occurs iff X = 0) and 1 (which occurs iff X = —1 or 
X = 1). Now Pr(Y = 0) = Pr(X = 0) = 2/8, and Pr(Y = 1) = Pr(X = —1 
or X = 1) = Pr(X = —1) + Pr(X = 1) = 1/8 + 5/8 = 6/8. So the pmf of 
Y is 


_ 41/4 fory=0, 
80) И, for y = 1, 


with g(y) = 0 elsewhere. 
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(2) Suppose that X — standard normal and that 


1 ifX«-I1, 
Y=42 if-lsxX =2, 
3 if 2<X. 


Then Pr(Y = 1) = F(-1), Pr(Y = 2) = F(2) — F(-1), P(Y = 3) = 1 — 
F(2), where F(-) denotes the standard normal cdf. So, referring to Table 
A.1, the pmf of Y is 


0.159 fory= 1, 
g(y) = 40.818 fory= 2, 
0.023 for y = 3, 


with g(y) = 0 elsewhere. 

(3) Suppose that X ~ rectangular on the interval [—1, 1] and that 
Y = X". Now the pdf of X is f(x) = 1/2 for -1 = x = 1, with f(x) = 0 
elsewhere. It may be tempting to say that Pr(Y = y) = Pr(X = —Vy) + 
Pr(X = Vy), but this will not help since all three events have zero 
probability. Instead, we proceed via cdf’s. The cdf of X is 


0 for x < —l, 
F(x) =4(1+x)/2 for -lsx<1l, 
1 for x > 1. 


Let G(y) = Pr(Y = y) be the cdf of Y. Clearly Y is confined to the interval 
[0, 1], so G(y) = 0 for y < 0. For y = 0, 


С(у) = Pr(Y x у) = Pr(X? = у) = Pr(- Vy = х= Уу) 
= F(V5) - F(- V5). 


Now for0=y<= 1, 
F(V5) - F(- V) = (1 + V2 - (1 — Vy? = У), 
while for у > 1, F(Vy) — F(- Vy) = 1 — 0 = 1. So the cdf of Y is 
0  fory x0, 


С(у) = 4 Vy for0<y<1, 
1 fory>1. 
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Figure 2.4 Distribution of a function. 


Finally, we get the pdf of y by differentiating its cdf: 


0 for y = 0, 
805) = дС(у)/ду = 41(2V5) forü c ys 1, 
0 for y > 1. 


Plotting the pdf as in Figure 2.4, we see a curve that slopes downward 
over the unit interval; it runs along the horizontal axis elsewhere. 
The shape of the pdf of Y may not have been anticipated from in- 
specting the rectangular shape of the pdf of X and the parabolic shape 
of the function Y = X^. 

(4) Linear Functions. Suppose that the continuous random variable X 
has pdf f(x) and cdf F(x) and that Y = a + 6X is a linear function of X, 
with a and b > 0 being constants. To find the pdf of Y, we follow the 
approach of (3). The cdf is 


G(y) = Pr(Y = y) = Pr(a + bX = y) = Pr[X = (y — ayb] 
= F[(y — ayb] = F(x), 
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where x = (y — a)/b. So the pdf of Y is 


£O) = G'O) = [ёЕ(х)/дх](дх/дуу = fixyb = (DFT — ayb). 


If b had been negative, then the 1/b would have been replaced by 
1/(—6). | 

As a special case, suppose that X — standard normal апа ihat Y = 
a + bX, with b > 0. The pdf of X is f(x) = (25) ^'^ exp(—x"/2), so the 
pdf of Y is 


g(y) = (Ub) (22)? exp(-[(y — ayb]/2). 


This specifies the (general) normal distribution with parameters a and p, 
which will be discussed in Chapter 7. 


Exercises 


2.1 One ball will be drawn from a jar containing white, blue, yellow, 
and green balls. The probability that the ball drawn will be green is 
0.25, and the probability that the ball drawn will be yellow is 0.14. What 
is the probability that the ball drawn will be white or blue? 


2.2 The probability that family A will buy a car is 0.7, the probability 
that family B will buy a car is 0.5, and the probability that both families 
will buy is 0.35. Find the probability of each of these events: 


(a) Neither buys. 
(b) Only one buys. 
(c) At least one buys. 


2.3 In a city, 65% of the families subscribe to the morning paper, 50% 
subscribe to the afternoon paper, and 80% subscribe to at least one of 
the two papers. What proportion of the families subscribes to both 
papers? 


2.4 Consider two events A and B such that Pr(A) — 1/2 and Pr(B) — 
1/3. Find Pr(B M not A) for each of these cases: 


(a) A and B are disjoint. 
(b) B C A. 
(с) Pr(B ПА) = 1/7. 
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2.5 Consider two events A and B with Pr(A) = 0.5 and Pr(B) = 0.7. 
Determine the minimum and maximum values of Pr(A П B) and the 
conditions under which each is attained. 


2.6 Consider an experiment in which a loaded coin (probability of 
heads is 5/6) is tossed once and a fair die is rolled once. 


(a) Specify the sample space for the experiment. 
(b) What is the probability that the coin will be heads and the number 
that appears on the die will be even? 


2.7 Discuss the appropriateness of the Bernoulli distribution as a 
model for unemployment. 


2.8 Consider these seven events: 


А = {Х= 1) В = {Х = 9} C 
D = {Xis even} E={1<X< 5} F-CUD 
G-CnD 


Consider also these three discrete probability distributions: 


(a) Bernoulli with parameter p = 0.4. 
(b) Discrete uniform with parameter N — 9. 
(c) Binomial with parameters n = 2, p = 0.4. 


For each distribution calculate the probability of each event. Treat 0 as 
an even number. 


2.9 Consider these three events: 
A-(X-2, B={X=3}, C= {X= 4}. 
Consider also these two discrete probability distributions: 


(a) Binomial with parameters n = 4, p = 0.6. 
(b) Poisson with parameter А = 1.5. 


For each distribution calculate the probability of each event. 
2.10 Consider these five events: 


А ={0=Х = 1/9} В = {Х = 1/9} С = (X = 1/2} 
р={14=Х=34р E={1<X <2} 
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Consider also these two continuous probability distributions: 


(a) Rectangular on the interval [0, 2]. 
(b) Power on the interval [0, 1] with parameter Ө = 2. 


For each distribution calculate the probability of each event. 
2.11 Consider these six events: 


A-(0zXzl) В={0=Х = 9} С= {11 = Х = 3} 
р={-1=Х5=1} Е={-2=Х=9} F={-3<=X< 3} 


Consider also these three continuous distributions: 
(a) Exponential with parameter А = 2. 


(b) Standard normal. 
(c) Standard logistic. 


For each distribution calculate the probability of each event. 


2.12 For each of the following, use the cdf approach to obtain the 
pdf of Y: 


(a) X distributed exponential, Y — 2 X. 

(b) X distributed rectangular on (0, 1), Y = —log(X). (Note: In this 
book, "log" always denotes natural logarithm.) 

(с) X distributed standard normal, Y = X°. 


3 Expectations: Univariate Case 


3.1. Expectations 


Our discussion of empirical frequency distributions in Chapter 1 placed 
considerable emphasis on the mean, my = 2,x;p(x;). In probability dis- 
tributions, the mean again plays an important role. The name for mean 
in a probability distribution is expectation, ox expected value. 

Suppose that the random variable X has pmf or pdf f(x), a situation 
that we write as X — f(x). Then the expectation of X is defined as 


У xf(x) in the discrete case, 
Е(Х) = Г 


xf(x) dx іп the continuous case. 


Let Z = h(X) be a function of X. To obtain the expectation of Z, one 
might first deduce g(z), the pmf or pdf of Z, and apply the definition: 
E(Z) = 2z,g(2j) or fZ- zg(z) dz. Equivalently, one can get the expectation 
of Z = A(X) as 


У, Мх ха) in the discrete case, 


E(Z) = + (= 
| Һх) х) dx їп the continuous case. 


The symbol р is also used to denote an expectation, so we will write 
Mx = E(X) and pz = E(Z). 


Example. Suppose X ~ Bernoulli(p). This is a discrete distribu- 
tion with f(0) = 1 — p, f(1) = р, where 0 = p = 1. We calculate E(X) = 
O(1 — p) + 1(p) = p. Also, let Z = X^. Then E(Z) = 0°(1 — р) + 18р = 
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р. Observe that E(X) = р, so (unless р = 0 or p = 1) the expected value 
of X will be a value of X that never occurs. Observe also that p — 
Е(Х?) = [EQF = f^. so in general the expectation of a function is not 
the function of the expectation: E[h(X)] # A[E(X)] = A(px). 


Example. Suppose X — rectangular on [0, 2]. This is a contin- 
uous distribution with f(x) = 1/2 for 0 = x = 2, f(x) = 0 elsewhere. We 
calculate 


E(X) = ү xf(x) dx = ij (x/2) dx = (1/2) [ x dx. 


But f x dx = x*/2 and (x*/2)] = 2. So E(X) = (1/2)2 = 1. Also, let Z = 
X?. Then 


2 2 
E(Z) = | (3/2) ах = (1/2) f х? dx. 


But f x? dx = x°/3 and (x?/3)]|5 = 8/3. So E(Z) = (1/2)(8/3) = 4/3. Observe 
again that Е(Х?) = 4/3 = 1 = [E(X)]". 


Caution: There are distributions whose expectations are infinite, or 
do not exist, but we ignore those possibilities throughout this book. 


3.2. Moments 


The moments of X are the expectations of integer powers of X, or of 
X* = X — ру. For nonnegative integers 7, 


E(X’) is the rth raw moment, or moment about zero, of X, 

E(X*’) is the rth central moment, or moment about the mean, of X. 
Each of the moments provides some information about the distribution. 
Taking 7 = 1, we have E(X) = p and E(X*) = 0. Taking r = 2, we have 


the second raw moment Е(Х?), and the variance: 


E(X**) = V(X) = а. 
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Table 3.1 Expectations and variances of illustrative distributions. 


Distributions E(X) V(X) 
Discrete 

(1) Bernoulli, parameter p р pa-p) 
(2) Discrete uniform, parameter N (N + 1y2 (N? — 1)/12 
(3) Binomial, parameters n, p np np(l- p) 
(4) Poisson, parameter À A A 
Continuous 

(1) Rectangular on the interval [a, b] (a + by2 (b — a)*/12 
(2) Exponential, parameter А 1/A Vr? 

(3) Standard normal 0 1 

(4) Standard logistic 0 т2/8 

(5) Power on [0, 1], parameter 8 0/(1 + 8) 6/[(1+0)7(2+ 8)] 


The variance of a random variable X is the expectation of the squared 
deviation of X from its expectation. It serves as a measure of the spread 
of the distribution. If V(X) = 0, then X is a constant, and conversely. 

For the distributions introduced in Chapter 1, Table 3.1 gives the 
expectations and variances. 


3.3. Theorems on Expectations 


Several useful theorems on expectations and moments are easy to estab- 
lish: 


T1. LINEAR FUNCTIONS. For linear functions, the expectation 
of the function is the function of the expectation, and the variance of 
the function is the slope squared multiplied by the variance. That is, if 
Z = а + bX where a and b are constants, then 


E(Z = а + bE(X),  V(Z)2 6? V(X). 
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Proof (fox continuous case). 


E(Z) fe + bx)f(x) dx = IE dx + [ dx 


a [feo de + b [feo de = a 1 + BECO) = a + BECO. 


Then Z* = Z — E(Z) = a + bX — [a + bE(X)] = bX — bE(X) = bX*, so 
V(Z) = E(Z*°) = E(X*? = PE(X*) = У(Х). m 


(Note: The symbol f is used in this chapter as shorthand for f^...) A 
parallel proof applies to the discrete case. 


T2. VARIANCE. The variance of a random variable is equal to the 
expectation of its square minus the square of its expectation. That is, 


V(X) = E(X*) — Е?(Х). 
Proof. Write V(X) = E(X**), where X* = X — E(X). Now X** = X° + 


E*(X) — 2E(X)X, so using ТІ extended to handle two variables gives 
E(X**) = E(X?) + E*(X) — 2E(X)E(X) = Е(Х?) — EX). т 


(Note: E*(X) denotes [Е(Х)].) Because E(X**) = 0, we can conclude that 
E(X?) = E*(X), with equality iff V(X) = 0, that is, iff X is a constant. 


T3. MEAN SQUARED ERROR. Let c be any constant. Then the 
mean squared error of a random variable about the point c is 
E(X — co =o? + (c — Y. 
Proof. Write (X — с) = (X — p) — (c ш = X* — (c — y). So 
(X — cy = X€ + (с — р)? — 2(c — p)X*. Then using T1 gives 
E(X — cy. = E(X**) + (c — uy — 2(c — p)E(X*). 
But E(X*) = 0 and E(X**) = o°. m 


T4. MINIMUM MEAN SQUARED ERROR. The value of c that 
minimizes E(X — с)? is с = W. 


Proof. From ТЗ, E(X — c = o? + (c — uy. But (c — py? = 0 with 
. equality iff c — р = 0, that is, iff c = p. m 
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3.4. Prediction 


Thus far, the expected value of a random variable is simply the mean 
in its probability distribution. We now offer a practical interpretation. 

Suppose that the random variable X has a known pmf or pdf f(x). A 
single draw will be made from the distribution of X. You are asked to 
forecast, predict, or guess the outcome, using a constant c as the pre- 
dictor. What is the best guess, that is, what is the best predictor? Suppose 
that your criterion for good prediction is minimum mean squared fore- 
cast error. Then you will choose c to minimize E(U*), where U = 
X — cis the forecast error. Ву Т4, the solution to your problem is с = p. 
The best predictor of a random drawing from a known probability 
distribution is the expected value of the random variable, when the 
criterion for predictive success is minimum mean squared forecast error. 

When you use p as the predictor, the forecast error is X — p = є, 
say, so the expected forecast error is E(e) = 0, and the expected squared 
forecast error is E(e?) = E(X — р)? = o°. А predictor for which the 
expected forecast error is zero is called an unbiased predictor. So p is an 
unbiased predictor, but there are many unbiased predictors. Let Z — 
p. + W where W is any random variable with E(W) = 0. Then Z is 
also an unbiased predictor of X, because E(X — Z) = E(X – р - №) = 
E(X — р) — E(W) = 0 — 0 = 0. But (unless W is correlated with X), 
E(X — Z = o? + VW) = o°. 

Different criteria for predictive success lead to different choices: It 
can be shown that to minimize E(| U|), you should choose c = median(X), 
and that to maximize Pr(U = 0), you should (in the discrete case!) 
choose c — mode(X). In econometrics, it is customary to adopt the 
minimum mean squared error criterion; as we have seen, this leads to 
the expected value as the best predictor. This is true even when, as in 
a Bernoulli distribution, the expected value is not a possible value of X. 


3.5. Expectations and Probabilities 


Any probability can be interpreted as an expectation. Define the variable 
Z which is equal to 1 if event A occurs, and equal to zero if event A 
does not occur. Then it is easy to see that Pr(A) = E(Z). 
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How much information about the probability distribution of a random 
variable X is provided by the expectation and variance of X? There are 
three useful theorems here. 


MARKOV INEQUALITY. If Y is a nonnegative random variable, 
that is, if Pr(Y < 0) = 0, and & is any positive constant, then Pr(Y = А) = 
E(Y Vk. 


Proof (for continuous case). Write 


со А со 
E(Y) = | yf dy = [ yf» dy + | yf йу= a + b, 


say. Now a = 0, so E(Y) = b. Also b = k f; f(y) dy = k Pr(Y = k). So 
E(Y) zkPr(Y > №. m 


CHEBYSHEV INEQUALITY £1. If X is a random variable, c is 
any constant, and d is any positive constant, then Pr(|X — c| = d) = 
E(X — oid. 


Proof. Let Y = (X — cy, so Y is a nonnegative random variable, and 
|X - c| z d Y > d. Let = d^, and apply the Markov Inequality to 
get EY) = а? Pr(Y > d). m 


CHEBYSHEV INEQUALITY #2. If X is a random variable with 
expectation E(X) = p and variance V(X) = o^, and d is any positive 
constant, then Pr(|X — p| = d) = о?/0?. 


Proof. Apply Chebyshev Inequality #1 with c = р. m 


How much information about the expectation of a function is pro- 
vided by the expectation of a random variable? As we have seen in T1, 
for linear functions the expectation of the function is the function of 
the expectation. But if Y = A(X) is nonlinear, then in general E(Y) = 
h{E(X)]: the direction of the inequality may depend on the distribution 
of X. For certain functions, we can be more definite. 

.Let E(X) = p, Y = A(X), 90Y/aX = h’(X). Let Z be the tangent line to 
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h(X) at the point p, that is, Z = Мр) + A'(w)(X — p). Since Z is linear 
in X, while А(р) and A'(p) are constants, we have 


E(Z) = Мр) + h'(u) E(X — p) = Ми). 


If Y = h(X) = Z everywhere, then regardless of the distribution of X, 
we are assured that E(Y) = E(Z) = h(w). Now, a concave function lies 
everywhere below its tangent line, no matter where the tangent line is 
drawn. Thus we have shown 


JENSEN'S INEQUALITY. If Y = h(X) is concave and E(X) = р, 
then E(Y) = h(p). 


For example, the logarithmic function is concave, so E[log(X)] = 
log[E(X)] regardless of the distribution of X. Similarly, if Y = h(X) is 
convex, so that it lies everywhere above its tangent line, then E(Y) = 
h(w). For example, the square function is convex, so E(X*) = EWF 
regardless of the distribution of X, as we have already seen. 


Exercises 


3.1 Verify the entries of expectations and variances in Table 3.1, 
except those for the standard normal and standard logistic. Note: The 
following definite integral is well known: 


| te ^ dt = па"! for a > 0 and n positive integer. 
0 


3.2 For each of the following distributions for the random variable 
X, calculate E(X) and V(X): 


(a) Discrete uniform, parameter N = 9. 
(b) Binomial, parameters » — 2, p — 0.4. 
(с) Binomial, parameters n = 4, p = 0.6. 
(d) Poisson, parameter А = 3/2. 

(e) Rectangular on the interval [0, 2]. 
(f) Exponential, parameter À = 2. 

(g) Power on (0, 1], parameter 0 = 2. 
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3.3 Suppose X has the power distribution on [0, 1] with parameter 
Ө = 5. Let Z = 1/X°. Find E(Z), E(Z*), V(Z). 


8.4 Suppose X has the exponential distribution with parameter À — 
4. Let Z = exp(X). Find E(Z), E(Z*), V(Z). 


8.5 Suppose X has the rectangular distribution on the interva! [0, 3]. 
Let Z = F(X) where F(.) is the cdf of X. Find E(Z). 


8.6 Suppose X — f(x). Let Z = F(x) where F(-) is the cdf of X. Find 
E(Z). 


3.7 Let A = (X = 1} and B = {|X — p| = 20}. Consider these three 
distributions: (i) Rectangular on the interval [0, 2], (ii) Exponential with 
parameter à = 2, (iii) Power on [0, 1] with parameter 0 = 3. 


(a) For each distribution, use the Markov or Chebyshev Inequality to 
calculate an upper bound on Pr(A) and Pr(B). 

(b) For each distribution, use the appropriate cdf to calculate the 
exact Pr(A) and Pr(B). 

(c) Comment on the usefulness of the inequalities. 


4 Bivariate Probability Distributions 


4.1. Joint Distributions 


'The focus in this book is on relations between variables, where the 
relations are not deterministic ones. So we need more than one variable 
in our probability distributions. We take up the bivariate case in detail. 
Consider an experiment that has various distinct possible outcomes. 
The outcomes are distinguished by the values of a pair of random 
variables X, Y. The values they take on are labeled x, y. Each trial of the 
experiment produces one value of the pair (x, у). We refer to the pair 
(X, Y) as a random vector, a name that merely indicates a set of random 
variables whose joint values are determined by the outcome of an exper- 
iment. As in the univariate situation, we distinguish two cases. 


Discrete Case 


In the discrete case, there are a finite, or countably infinite, number of 
distinct possible values for X, and also for Y, and thus for the pair (X, Y). 
So we can list the distinct possible pairs, say as the column and row 
headings of a two-way array. The points on the list are called mass points. 
There is a function f(x, y), called the joint probability mass function, or joint 
pmf, of the distribution. It must satisfy these requirements: f(x, y) = 0 
everywhere, f(x, у) > 0 only at the mass points, and 


> D /% уу) =], 


The joint pmf gives the basic assignment of probabilities via: 
Р(Х = x, Y = у) = f(x, y). 


Probabilities of other events then follow in the usual fashion. 
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If we enter the f's in the cells of the two-way array, we have a table 
that looks like an empirical joint frequency distribution, such as that for 
savings rate and income. But now the entries in the table are probabil- 
ities rather than frequencies, or if you like, they are frequencies in a 
population rather than in a sample. 


Example: Trinomial Distribution with parameters n, p,q. Here n is 
a positive integer, 0 < р = 1,0 =q = l, апар + q = 1. The joint pmf 
is: 


= € —— х = — „nax 
ft, y) T [xty! (n RAUM me? 1 Ё q) » 


forx 20,1,...,nandy = 0, 1,..., п — x, with f(x, у) = 0 otherwise. 
This might be a relevant model for the labor force status of individuals 
over n months, using a three-way breakdown of status for each month: 
employed, unemployed, or not in labor force. The variables over an n- 
month period would be X — number of months employed, Y — number 
of months unemployed, and n — X — Y = number of months not in 
labor force. 


Continuous Case 


In the continuous case, there is a continuum of distinct possible out- 
comes for X and also for Y, and thus a two-dimensional continuum of 
possible outcomes for the pair (X, Y). There is a function f(x, y), called 
the joint probability density function, or joint pdf, of the distribution. It 
must satisfy these requirements: f(x, y) z 0 everywhere, and 


|: Гл y) dy dx = 1. 


The joint pdf gives the basic assignment of probabilities as follows. For 
anya £ b,c Sd: 


b pd 
Pra=X<bc=¥=d)=| f fe» 4 de. 


Probabilities of other events follow in the usual fashion. As in the 
- univariate case, the pdf does not give the probability of being at a point: 
Р(Х = x, Y = y) = 0 everywhere, even where f(x, y) # 0. 
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Figure 4.1 Roof distribution: joint pdf. 


Example: Roof Distribution. The joint pdf is 
fy) = (к + у) for0sxsland0syszl, 
with f(x, у) = 0 elsewhere. Clearly f(x, у) 2 0 everywhere, апа 


i [ fe» dy dx = [| [ (x + y) dy dx 
= [ «ym de 


1 
= | (x + 1/2) dx = (3/2 + x/2)) 
0 


=1/2+ 12 = 1. 


So this is a legitimate joint pdf. The plot in Figure 4.1 accounts for the 
name “roof distribution.” We will use this joint pdf as an example 


because the integration is simple. 


The joint cumulative distribution function, or joint cdf, may be defined 
as F(x, y) = fL. р. f(s, t) dt ds = Р(Х S x, Y = y). 
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4.2. Marginal Distributions 


We proceed to implications of the initial assignment of probabilities in 
a bivariate probability distribution. 


Discrete Case 


Let A = (X = x} and A; = (X = x, Y = y] for = 1, 2,.... Recognizing 
that A = U; A; is a union of disjoint events, we calculate 


Pr(X = x) = Pr(A) = È РА) = È fix, у) = fie), 
j j 
say. This new function f(x) is called the marginal pmf of X. Observe 


that f;(x) = 0 everywhere, that f;(x) > 0 only for points іп the list, and 
that 


Уле) = У |> у] = 1. 


So f(x) is a legitimate univariate probability mass function. Similarly, 
о) = Zif(x;, у) = Pr(Y = y) is the marginal pmf of Y. The subscripts 1 
and 2 merely distinguish the two functions. 

Example. For the trinomial distribution, we can verify that 


! 
fem Gig еде, 


forx = 0, 1,..., п, with fi(x) = 0 otherwise. This is recognizable as 
the pmf of a binomial distribution with parameters n, p. 


Continuous Case 
Let A = {a £ X = b} = {a < X = b, — œ s Y < œ}. Then 
b po b 
Pr(a = X = b) = Pr(A) = Í Í flee, у) dy dx = f fi) dx, 


say, where 


ло = лу 
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This new function f(x) is called the marginal pdf of X. Observe that 
fi) = 0 everywhere (since it is the integral of a nonnegative function) 
and that f^. fi(x) dx = f^. f^. f(x, y) dy dx = 1. So fi(x) is a legitimate 
univariate pdf. 


Similarly, fo(y) = f. f(x, y) dx is the marginal pdf of Y. 


Example. For the roof distribution, 
1 
Ло) = | (x + у) dy = (xy + у) =x + 1/9 for0sxs1, 
0 


with f,(x) = 0 elsewhere. Figure 4.2 plots this marginal pdf. 


4.3. Conditional Distributions 


We continue to draw implications of the initial assignment of probabil- 
ities in a bivariate distribution. 


f(x) 


0.0 0.2 0.4 0.6 0.8 10 


Figure 4.2 Roof distribution: marginal pdf. 


4.3 Conditional Distributions 39 


Discrete Case 


Let A = (X = x}, B = (Y = у}. In probability theory, if Pr(B) # 0, one 
defines the probability that A occurs given that B occurs as 


Pr(A|B) = Pr(A N B)/Pr(B). 
Now in the discrete case, Pr(A N B) = f(x, у) and Pr(B) = (5). so 
Pr(A|B) = fix, уьу) = evil), 


say, a function that is defined only for y such that f(y) 7 0. 
For any such y value, observe that g,(x|y) is a function of x alone, 
gi(x|y) = 0 (because f(x, у) = 0 and f(y) > 0), and 


x gx» = x 6, 050) = [x [om / fs) 


= fo(y/foly) = 1. 


So for any y value with positive mass, g,(x|y) is a legitimate univariate 
pmf for the random variable X. It is called the conditional pmf of X 
given Y — y, and may be used in the ordinary way. For example, the 
probability that the random variable X takes on the value x given that 
the random variable Y takes on the value y; is Pr(X = x|Y = y) = g,(x|y,). 

Running across j, there is a set of conditional probability distributions 
of X—one distribution of X corresponding to each distinct possible value 
of Y. Conditioning on Y may be viewed as partitioning the bivariate 
population into subpopulations. Within each subpopulation, the value 
of Y is constant while X varies. 

Similarly, go(y|x) = f(x, yy/fi(x), defined for all x such that ў (х) # 0, is 
the conditional pmf of Y given X — x. There is one such distribution 
for each distinct value of X. The pattern here is precisely the same as 
in the empirical joint frequency distribution of income and savings rate. 


Continuous Case 


In the continuous case, we proceed rather differently. For each y such 
that f(y) # 0, define the function 


al» = flx, Wf). 
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leaving g,(-|y) undefined elsewhere. This g,(x|y) is called the conditional 
pdf of X given Y = y. It is easy to confirm that g,(x|y) = 0 everywhere 
where defined, and that 


f ee dc | ieoa] enal Гло 
= ff) = 1. 


So for any y value with positive density, g;(x|y) is a legitimate univariate 
pdf for the random variable X. 

There are an infinity of such conditional distributions, one for each 
value of Y. Each of them can be used in the ordinary way. For example, 


b 
Pra = X = b|Y = y) = | 810015) dx. 


Observe that we have succeeded in defining Pr(A|B) even though 
Pr(B) = 0. This would be nonsense in the discrete case, but it is quite 
meaningful in the continuous case, where zero probability events do 
occur. | 

For a quite distinct example, suppose we want Рг(А |В), where А = 
{a = X = b) and B = (c S Y s d). Provided that Pr(B) 0, we have 


Pr(A|B) = Pr(A  B)/Pr(B) = i [л y dy | / [ ло dy. 


Similarly the conditional pdf of Y given x, defined for all x such that 
fi) 3 0, is galla) = Дх, fi). 


Example. For the roof distribution, gə(y|x) is defined only for 
0 =x = 1. There 


goly|x) = flx, У(Х) = (х + »y(x + 12) for sys l, 


with gs(y|x) = 0 elsewhere. Figure 4.3 plots this function for x = 0, 
0.5, 1. 


Mixed cases may arise in a bivariate population. For example, if Y = 
family income and X = number of persons in family, then a natural 
model would have X discrete and Y continuous. In such situations, the 
joint distribution is most conveniently specified in terms of the marginal 
pmf of the discrete variable and the conditional pdf of the continuous 
variable. 
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Figure 4.3 Roof distribution: conditional pdf’s. 


Exercises 


4.1 Curved-roof Distribution. Consider the continuous bivariate proba- 
bility distribution whose joint probability density function is 


fte» = 3? + у/11 for0 =x < 2,0 = у ж 1, 


with f(x, у) = 0 elsewhere. The plot of this pdf in Figure 4.4 accounts 
for its name. 


(a) Show that the marginal pdf of X, plotted in Figure 4.5, is 
Лб) = 3(2x® + 1/22 forO <x <2, 
with fi(x) = 0 elsewhere. 


(b) Derive f(y), the marginal pdf of Y. 


(с) For 0 = x = 2, derive gs(y|x), the conditional pdf of Y given X, 
plotted in Figure 4.6 for x = 0, 1, 2. 
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Figure 4.4 Curved-roof distribution: joint pdf. 
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Figure 4.5 Curved-roof distribution: marginal pdf. 
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Figure 4.6 Curved-roof distribution: conditional pdf’s. 


4.2 For the curved-roof distribution, let A = {0 = Y < 1/2). Calculate 
Pr(A), and calculate Pr(A|x) for x = 0, 1, 2. 


4.3 For the curved-roof distribution, derive g,(x|y), the conditional 
pdf of X given Y. 


5 Expectations: Bivariate Case 


5.1. Expectations 


Suppose that the random vector (X, Y) has joint pdf or pmf f(x, y). Let 
Z = h(X, Y) be a scalar function of (X, Y). Then the expectation of the 
random variable Z is defined as 


: | Í h(x, ух, у) dy dx іп the continuous case, 
E2)24  "- 

У È мх, у) хь, у) in the discrete case. 

ij 


If in fact Z is a function of only one of the two variables, then its 
expectation is also computable from the marginal distribution of that 
variable. That is, if h(x, y) = h(x), then 


E(Z) = У 2 һ(х)/(хь у) = x hy(x;) [3 fe ») 
s > hy (xfi б). 


(Note: Here and subsequently, we will usually report derivations for 
only one of the two cases, discrete and continuous. The understanding 
is that a parallel derivation applies to the other case.) 

The moments of the joint distribution are the expectations of certain 
functions of (X, Y), or of (X*, Y*), where Х* = X — E(X), Ү* = Y — 
E(Y). For nonnegative integers r, s: 


E(X Y?) is the (r, s)th raw moment, or moment about zero, 
E(X*'Y*) is the (r, s)th central moment, or moment about the mean. 
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In particular, 

r= 1,5 = 0: E(X'Y°) = Е(Х) = py, 

т = 2,5 = 0: Е(Х*?Ү*©у = E[X - E(X)? = V(X) -oi, 
r= 1,5= 1: E(X*¥*) = E([X - Е(Х)[Ү — E(Y)]) = С(Х, Y) = оуу. 


The last of these is called the covariance of X and Y. Thus the covariance 
of a pair of random variables is the expected cross-product of their 
deviations from their respective expectations. 

The standard deviation of a random variable is the square root of its 
variance. So the standard deviation of X is ох = Vo%, and similarly for 
Y. The correlation coefficient, or simply correlation, of a pair of random 
variables is the ratio of their covariance to the product of their standard 
deviations. So the correlation coefficient of X and Y is 


р = C(X, YJ[VV(X)VV(Y)] = oxy/(ox0y). 


Some useful theorems are readily established: 


T5. LINEAR FUNCTION. Suppose that Z = a + bX + cY, where 
a, b, c are constants. Then 


E(Z) = a + bE(X) + cE(Y), 
V(Z) = V(X) + VY) + 2bcC(X, Y). 
Proof. For the expectation, 
E(Z) = > > (a + dx; + су) уу) 
= 2 È ў у) T 2^ Xi Р] + ez y Pics) 
=a 1 + ›> x; fix) + 2 y fa 
For the variance, V(Z) = E(Z*5, where 
Z* = Z — E(Z) = ЫХ — E(X)] + СУ — E(Y)] = bX* + cY*. 
Expanding the square gives Z*" as a linear function of X*?, Y*°, and 


X*Y*. Use the rule for expectation of a linear function, extended to 
bandle three variables. m 


46 5 Expectations: Bivariate Case 


T6. PAIR OF LINEAR FUNCTIONS. Suppose that 


Zi = a, + b,X + сҮ, where а), Ьу, c, are constants, 


Zo = а» + Х + сҮ, where as, bo, сә are constants. 


Then C(Z,. 25) = biba V(X) + cyCgQV(Y) + (bica + б»су)С(Х, Y). 
Proof. Extend the approach used for V(Z) in the proof of T5. Ш 


T7. COVARIANCE AND VARIANCE. 


C(X, Y) = E(XY) — E(X)E(Y) = C(Y, X). 
V(X) = E(X?) — E?(X) = C(X, X). 


Proof. C(X, Y) = E(X*Y*) = E{[X — E(X)]¥*} = E(XY*) — EQOE(Y*) 
= E(XY*) = Е{Х[Ү — Е(У)]} 
= E(XY) — E(X)E(Y). m 
Example. For the roof distribution introduced in Section 4.1, 
these moments are calculated by integration in the joint pdf: 
E(X) = EY) = 7/2, E(X*) = EY) = 5/12, — E(XY) = 1/3. 
Then 


V(X) = E(X”) — EX) = 5/12 — 49/144 
C(X, Y) = E(XY) — E(X)E(Y) = 1/3 — 49/144 


ii 


11/144 = W(Y), 
— 1/144. 


li 


5.2. Conditional Expectations 


In a bivariate probability distribution, the conditional expectation of Y given 
X = x is the counterpart of the sample conditional mean my), that was 
introduced in Chapter 1. 


DEFINITION. Let the random vector (X, Y) have joint pdf f(x, y) = 
ga(ylx)fi(x), and let Z = АХ, Y) be a function of (X, Y). Then the con- 
ditional expectation of the random variable Z, given X = x, is 
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EZ = | кво dy 
in the continuous case, and similarly in the discrete case. 


The symbol E(-|x) denotes an expectation taken in the distribution 
gs (yx), so E(Z|x) = pz), is just the expected value of A;(Y) = h(x, Y) in 
a particular univariate distribution. If we then allow X to vary, we get a 
set of conditional expectations, denoted collectively as E(Z|X) = рах. 

To illustrate the concepts, consider some special cases. Here a, b, c 
are constants, X is a random variable, and x is a particular value of that 
variable. Given X — x, then any function of X alone is constant. With 
that in mind, the following results are immediate: 


(i) Let Z = A(X). Then-E(Z|x) = h(x). 

(ii) Let Z = h (XXY. Then E(Z|x) = hy(x)E(¥|x). 

(iii) Let Z = a + bX + сҮ. Then E(Z|x) = a + bx + cE(Y|x). 

(iv) Let Z = Y. Then E(Z|x) = E(Y|x) = Hyj» the conditional expec- 
tation of Y given that X = x. 

(у) Let Z = (Y – рух). Then E(Z|x) = E(Y|x) — ру, = 0. 

(vi) Let Z = (Y — рух). Then E(Z|x) = V(Y|x) = of» the conditional 
variance of Y given that X = x. 

(vii) Let Z = (Y — py). Then E(Z|x) = E(Y|x) — py = ру — pr. 

(viii) Let Z = (Y — py). Then E(Z|x) = 00, + (ы, — py). 


Proof of (viii). Write Y — py = (Y — рух) + (рих — By), SO 


(Y — py)” = (Y — рух) + (рих — By)? + (рых — р) — Бух): 


Take expectations conditional on X = x, using (v), (vi), and the condi- 
tional constancy of (рух — By). ш 


Now allow X to vary, so that E(Z|X) is itself a random variable, taking 
on the values E(Z|x). Several key theorems are easily established: 


T8. LAW OF ITERATED EXPECTATIONS. The (marginal) 
expectation of Z — A(X, Y) is the expectation of its conditional expec- 
tations: | 


E(Z) = Ех[Е(2|Х)]. 
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(Note: The symbol Ex, read as "the expectation over X," is the expec- 
tation taken in the marginal distribution of X. The subscript may be 
omitted if there is no risk of confusion.) 


Proof. 


E) = | | муу dy de 


= | Г. Мх, У)[ге(у|х)/\(х)] dy dx 


со 
—осо 


z Í | I: Мх, у)до(у|х) d| fils) ах 


со 


= | варо). и 


T9. MARGINAL AND CONDITIONAL MEANS. The (mar- 
ginal) expectation of Y is equal to the expectation of its conditional 
expectations: 


By = E(Y) = Ex[E(Y|X)] = Е(рух). 


T10. ANALYSIS ОЕ VARIANCE. Тһе (marginal) variance of Y 
is equal to the expectation of its conditional variances plus the variance 
of its conditional expectations: 


oy = V(Y) = Ex(V(Y|X)] + Vx(E(Y|X)] = E(o?1) + (р). 


Proof. Write V(Y) = E(Z) where Z = (Y — py)”, and apply T8 to item 
(viii) in the list above. 


T11. EXPECTED PRODUCT. The expected product of X and Y 
is equal to the expected product of X and the conditional expectation 
of Y given X: 


E(XY) = Ех[ХЕ(Ү|Х)] = Е(Хшу у). 
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Т12. COVARIANCE. The covariance of X and Y is equal to the 
covariance of X and the conditional expectation of Y given X: 


C(X, рих) = E(Xpyix) 7 Е(Х)Е(рих) = E(XY) — EKE) 
= C(x, Y). 


5.3. Conditional Expectation Function 


As we have seen, the conditional expectation of Y given that X = x is 


Es) = w= | эе). 


As we change x, that is, allow X to vary, we get E(Y|X) = рух, a function 
of X, known as the conditional expectation function, or CEF, or “population 
regression function” of Y given X. Similarly, V(Y|x) is the conditional 
variance of Y given that X — x, and V(Y|X) is the conditional variance 
function, or CVF, of Y given X. The shapes of the CEF and CVF are 
determined ultimately by the joint pmf or pdf. (Note the confusing 
language: the CEF and CVF of Y given X are, mathematically speaking, 
functions of X.) 


Example. For the roof distribution, the CEF of Y given X is 
defined only for 0 = x = 1. There 


1 
Буј) = | [ук + ух + VD d 


= [I(x + 1/2]697/2 + 3/3) = [1/(х + 1/2)](x/2 + 1/3) 
= (3x + 2)/(6х + 3). 
This function is plotted in Figure 5.1. 


The deviation of Y from its CEF has certain characteristic properties. 
Let e = Y — E(Y|X). Because є is just the deviation from a (conditional) 
expected value, we have 


(5.1) — E(e|X) = 0, 


(5.9) | V(e|X) = суу. 
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n 
I 


Figure 5.1 Roof distribution: CEF and BLP. 
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From these it follows that 
(5.3) E(e) = 0, 
(5.4) C(X, є) = 0, 
(5.5) V(e) = E(oy4x), 
(5.6) If Z = A(X), then C(Z, є) = 0. 
Proofs. By ТЗ and Eq. (5.1), E(e) = Ex[E(e|X)] = E4(0) = 0. By T12 


and Eq. (5.1). C(X, e) = C[X, E(e|X)] = C(X, 0) = 0. By T10, Eq. (5.2), 
and Eq. (5.1). V(e) = Е(с? х) + V(pax) = Е(о?х). Finally, 


C(Z, є) = E(Ze) ~ Е(2)Е(є) = E(Ze) = ExIZE(e|X)] = E(Z0) = 0. ш 
We conclude that the deviation of Y from the function E(Y|X) is a 
random variable that has zero expectation, and zero covariance with 


every function of the conditioning variable X. No other function of X 
yields deviations with the latter property. 
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5.4. Prediction 


Recall the univariate prediction problem introduced in Section 3.4: The 
random variable Y has known pmf or pdf f(y). A single draw is made 
and you are asked to predict the value of Y. The best constant predictor 
c, in the sense of minimizing E(U^) where U = Y — c is c = py. For 
that optimal choice of с, we have U = Y — py, with E(U) = 0 and 
E(U*) = E(Y — py)” = oy. 

Now consider this prediction problem for the bivariate case: The 
random vector (X, Y) has known joint pmf or pdf f(x, у). A single draw 
is made. You are told the value of X that was drawn and asked to predict 
the value of Y that accompanied it. You are free to use any function of 
X, say h(X), as your predictor. What is the best predictor, in the sense 
of minimizing E(U?) where U = Y — h(X)? The answer is h(X) = E(Y|X). 
That is, the best predictor of Y given X is the CEF. | 


Proof. Let ВХ) be any function of X, let U = Y ~ A(X), є = Y ~ 
E(Y|X), and И = E(Y|X) — A(X), so that U = є + И, with W being a. 
function of X alone. For a particular X = x, we have W = E(Y|x) — 
h(x) = w, say. So at X = x, we have U = є + ш, so 1? = є? + ш? + 2we, 
so 


E(U? |x) = E(e?|x) + и? + 2шЕ(є|х) = сў, + ш”, 
using Eqs. (5.1) and (5.2). Across all X, 
E(U*) = Ex{E(U"|X)] = E(o}jx) + E(W°). 


The last term is nonnegative and vanishes iff W = 0, that is, iff h(X) = 
E(Y|X). ш 


For the optimal choice of A(X), the prediction error is U = Y — 
E(Y|X) = e, with E(U) = E(c) = 0, E(U*) = Е(є) = E(oyjx). From T10 
we see that E(Y|X) is a better (strictly, no worse) predictor than E(Y). 
Both predictors are unbiased, but in general, the additional information 
provided by knowledge of the X-value improves the prediction of Y. 

Continuing with the bivariate setting of the prediction problem, sup- 
pose that we confine the choice of predictors to linear functions of X: 
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h(X) = a + bX. The best such function, in the sense of minimizing E(u”), 
where U = Y — A(X), is the line 


E*(Y|X) = a + BX, 
where 
В = exy/ox, а = py — Вих. 


This line is called the linear projection (or LP) of Y on X, or best linear 
predictor (BLP) of Y given X. 


Proof. Write U = Y — (а + bX), and use the linearity of the expectation 
operator to calculate 


8E(U?yàa = E(aU7/da) = 2E(UdU/aa) = —2E(U) 
aE(U*)/ab = E(aU?/ab) = 2E(UaUlab) = —2E(XU). 


The first-order conditions are E(U) = 0 and E(XU) = 0, which together 
are equivalent to E(U) = 0 and C(X, U) = 0. Substituting for U, we get 


E(Y) = a + bE(X), 
C(X, Y) = bV(X). 


The solution values are denoted as a and f, and it can be confirmed 
that they locate a minimum. ш 


'The minimized value of the criterion is 
E(U^) = VIY — (а + BX)] = VY) + g^v(x) — ?ВС(Х, Y) 
= V(Y) – gx). 


Example. For the roof distribution, use the moments previously 
obtained to calculate 


В = (-1/144)(11/144) = -1/11, а = 7/12 — (—1/11)7/12 = 7/11. 


So E*(Y|X) = 7/11 — (1/11)X. This BLP is plotted along with the СЕЕ 
in Figure 5.1. m 
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The deviation of Y from its BLP has several characteristic properties. 
Let U = Y — E*(Y|X) = Y — (a + BX). Then we have 


(5.7) E(U)=0, 
(5.8)  C(X,U)- 0, 
(59 (0) = WY)- gwv(x). 


Proofs. The first-order conditions are equivalent to E(U) = 0 and 
C(X, U) = 0. With E(U) = 0, we have V(U) = E(U*). ш 


We conclude that the deviation of Y from the function E*(Y|X) is a 
random variable that has zero expectation, and zero covariance with 
the conditioning variable X. 


5.5. Conditional Expectations and Linear Predictors 


We have just developed two predictors of Y given X: the CEF, which is 
the best predictor, and the BLP, which is the best linear predictor. We 
were already familiar with the marginal expectation E(Y), which is the 
best constant predictor. It will be useful to recapitulate the connections 
among these concepts. 

Because E(Y|X), E*(Y|X), and E(Y) solve successively more con- 
strained minimization problems, it is clear that, as a predictor of Y, the 
BLP is worse (no better) than the CEF, and better (no worse) than the 
marginal expectation. 

A sharp distinction between the CEF and the BLP refers to prediction 
errors. Let U = Y — E*(Y|X) and є = Y — E(Y|X); then U has zero 
covariance with X, while є has zero covariance with every function of X. 

A pair of theorems relates the BLP to the CEF: 


T13. LINEAR APPROXIMATION ТО СЕЕ. The best linear 
approximation to the СЕЕ, in the sense of minimizing E(W*) where W = 
E(Y|X) — (a + bX), is the BLP, namely E*(Y|X) = a + ВХ, with В = 
C(X, YYV(X), а = E(Y) ~ BE(X). 


Proof. This is formally the same linear prediction problem as was 
solved in Section 5.4, except that W plays the role of U and E(Y|X) plays 
.the role of Y. So the solution values must be 
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а = Е(рух) — bE(X), b = C(X шх) V(X). 


But Е(рух) = E(Y) by T8, and С(Х, рух) = C(X, Y) by T12. m 


T14. LINEAR CEF. If the CEF is linear, then it coincides with the 
BLP: if E(Y|X) = a + bX, then b = C(X, YYV(X) = B and a = E(Y) – 
BE(X) = a. 


With this as background, we may be able to clarify the concept of 
linear relation as used in empirical econometrics. One often reads that a 
dependent variable Y is assumed to be a linear function of X plus an 
error (or disturbance). Some care in interpreting such statements is 
needed. Taken by itself, Y = a + bX + U is a vacuous statement. When 
supplemented by E(U) = 0, it amounts only to stating that E(Y) = a + 
bE(X). When further supplemented by C(X, U) = 0, it amounts only to 
announcing that the BLP is being labeled as a + bX. But to say that Y = 
a + bX + U, with E(U|X) = 0 for all X, is to assume something, namely 
that the CEF is linear. 

Finally, referring back to Chapter 1, we see that the CEF is the 
population counterpart of the sample conditional mean function my;x, 
while the BLP may be viewed as the population counterpart of a certain 
smoothed sample line mj. 


Exercises 


5.1 Curved-roof Distribution. Recall the bivariate distribution introduced 
in Exercise 4.1, whose joint pdf is 


Ах, у) = 367 + у)/11 for0=x=2,0sy,=1, 


with f(x, y) = 0 elsewhere. Figures 4.4, 4.5, and 4.6 plotted f(x, y), the 
marginal pdf f(x), and selected conditional pdf's gə(y | x). 


(a) For 0 = x = 2, find the СЕЕ of Y given X. 
(b) Calculate E(X), E(Y), E(X^), E(Y^), E(XY), V(X), V(Y), C(X, Y). 
(c) Find the BLP of Y given X. 
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Figure 5.2 Curved-roof distribution: CEF and BLP. 


(d) Comment on Figure 5.2, where the CEF and BLP are plotted. 

(e) In the figure, the BLP appears closer to the CEF at high, rather 
than low, values of x. Why does that happen? Hint: See Figure 
4.5. 


МУ 


5.2 For the joint pmf in the table below: 


(a) Find the conditional expectation function E(Y | X). 
(b) Find the best linear predictor E*(Y | X). 
(c) Prepare a table that gives E(Y | x) and E*(Y |x) for x — 1, 2, 3. 


х= 1 x=2 x=3 


y=0 015 0.10 0.15 
1 0.15 030 0.15 


5.3 Suppose that the random variables Z (= permanent income in 
. thousands of dollars) and W (= transitory income in thousands of 
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dollars) have zero covariance, with E(Z) = 42, E(W) = 0, V(Z) = 2500, 
V(W) = 500. Further, X (= current income in thousands of dollars) is 
determined as X = Z + W. 


(a) Calculate E(X), C(Z, X), C(W, X), and V(X). 

(b) Find the BLP of current income given permanent income. 

(c) Predict as best you can the current income of a person whose 
permanent income is z = 54. 

(d) Find the BLP of permanent income given current income. 

(e) Predict as best you can the permanent income of a person whose 
current income is x = 54. 

(f) Comment on the relation between the answers to (c) and (e). 


5.4 For the setup of Exercise 5.3, suppose also that Y (= consumption 
in thousands of dollars) is determined as Y = (6/7)Z + U where U (= 
transitory consumption in thousands of dollars) has E(U) — 0, V(U) — 
250, C(Z, U) = 0, C(W, U) = 0. Find E*(Y|Z), the BLP of consumption 
given permanent income, and also find E*(Y|X), the BLP of consump- 
tion given current income. Comment on the distinctions between these 
two BLP's. 


5.5 Provide a counterexample to this proposition: If Vi, Və, Vs are 
three random variables with V, = Va + Vs, then V, and V4 must have 
nonzero covariance. Hint: Use the setup of Exercise 5.3. 


5.6 The random variables X and Y are jointly distributed. Let € — 
Y — E(Y|X) and U = Y — E*(Y|X), where E(Y|X) is the CEF and E*(Y|X) 
is the BLP. Determine whether the following is true or false: C(e, U) — 
V(e). 


5.7 Suppose that the criterion for successful prediction were changed 
from minimizing E(U^) to minimizing E(|U|), where U = Y — h(X). 


(a) Show that the best predictor would change from E(Y|X) to 
M(Y|X), the conditional median function of Y given X, defined as the 
curve that gives the conditional medians of Y as a function of X. 

(b) Comment on the attractiveness of the conditional median function 
as a description of the relation of Y to X in a bivariate population. 
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5.8 In a bivariate population, let us define the best proportional predictor 
of Y given X as the ray through the origin, E**(Y|X) = yX, with y being 
the value for c that minimizes E(U?), where now U = Y ~ cX. 


(a) Show that y = Е(ХҮ)/Е(Х?). 

(b) Is E**(Y|X) an unbiased predictor? Explain. 

(c) Let = Y — yX. Does C(X, U) = 0? 

(d) Find the minimized value of E(U?). 

(e) Compare this E(U?) with those that result when the marginal 
expectation is used, and when the BLP is used. 


6 Independence in a Bivariate Distribution 


6.1. Introduction 


We can recognize at least three possible responses to the question, How 
is Y related to X in a bivariate population?: the conditional pdf's (or 
pmf’s) go(y|x), the conditional expectation function E(Y |X), and the best 
linear predictor E*(Y |X). Correspondingly, we can recognize three pos- 
sible responses to the question, What does it mean to say that Y is not 
related to X in the bivariate population? 


6.2. Stochastic Independence 

In any bivariate probability distribution we can write the joint pdf (or 

pmf) as the product of a conditional and a marginal pdf (or pmf): 
Дх, у) = gely|x)filx) for all (x, y) such that f(x) 7 0. 

To start, we say that Y is stochastically independent of X iff 


go(y|x) = /%(у) for all (x, y) such that (х) # 0, 


where f(y) does not depend on x. In other words, the conditional 
probability distribution of Y given x is the same for all x-values; that is, 
the conditional probability distribution does not vary with—“is indepen- 
dent of "—X. One implication of stochastic independence is immediate: 
the marginal pdf of Y is 


AO) = онло de = поло dx = fo) Гле dx 
= ffo). 
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(Note: In this chapter the symbol f is used as shorthand for f7..) That 
is, if the conditional distribution of Y is the same for all values of X, 
then the marginal distribution of Y coincides with that common condi- 
tional distribution. So 


Y is stochastically independent of X iff f(x, y) = fi(x)fo(y). 


That is, Y is stochastically independent of X if and only if the joint pdf 
(or pmf) factors into the product of the marginal pdf's (or pmf's). In 
that event, the conditional pdf (or pmf) of X given Y is 


excl) = fle, fly) = ADA) = fie). 


We conclude that Y is stochastically independent of X if and only if X 
is stochastically independent of Y. So stochastic independence is a sym- 
metric relation, which leads us to the equivalent, more traditional 


DEFINITION. Xand F are stochastically independent iff 


Ах, y) = fiA) for all (x, y). 


For brevity, the unqualified term independent is often used instead of 
stochastically independent. 


The following implications are straightforward. 


If X and Y are independent, then: 


11. If A is an event defined in terms of X alone, and B is an event 
defined in terms of Y alone, then Pr(A П B) = Pr(A)Pr(B); that is, A 
and B are independent events. 


I2. IfZ — h(X) isa function of X alone, then Z and Y are independent 
random variables. 


Proof. Recall from Section 2.5 how, in a univariate distribution, one 
goes from f(x), the pdf or pmf of X, to g(z), the pdf (or pmf) of the 
function Z = A(X). Apply the same method here to go from each g,(x|y) 
to a gf(z|y). If g,(x|y) is the same for all y, then g¥(z|y) must also be the 
same forally. m 
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I3. Let Z, = h,(X) be a function of X alone, and Zə = h,(Y) be a 
function of Y alone. Then Z, and 7» are independent. 


Proof. By 12, Z, and Y are independent. Apply I2 again with Z, taking 
the role of Y, and Y taking the role of X. m 


6.3. Roles of Stochastic Independence 


Stochastic independence will play several roles in this book. 

* Independence serves as a stringent baseline for discussing relations 
among variables. If X and Y are independent random variables, then it 
is certainly appropriate to say that there is "no relation" between them 
in the population. 

* Independence serves as a device for building a joint distribution 
from a pair of marginal distributions. If X ~ fi(x), Y ~ fo(y), and X and 
Y are independent, then f(x, у) = fi(x)fa(y). 


Example: Tossing Two Coins. Suppose X ~ Bernoulli(p) and 
Y ~ Bernoulli(p), so for all (x, y) with x and y being 0 or 1, 


ло = РО р hO) = Pa- p. 


What is the joint distribution of (X, Y)? That is, how does one fill in the 
probabilities of the four possible paired outcomes? Without an assump- 
tion of independence (or some other information), we cannot fill them 
in; with it, we can. 


A couple of remarks on this example: 

(1) If X and Y are independent, then we will have a random vector 
(X, Y) in which the two random variables are independent and identically 
distributed. We will then say that (X, Y) is a random sample of size 2 from 
the Bernoulli(f) population. 

(2) “Independent” and “identically distributed" are distinct concepts. 
We might have had X — Bernoulli(p) and Y ~ Bernoulli(p*), with p* 7 
р, in which case the variables would not be identically distributed. Or 
we might have had them both Bernoulli(p) but not independent. 

Índependence serves as a device for building up a behavioral model. 
For example, suppose Y = 1 if a household purchases a car, Y = 0 if 
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not, for the population of households in 1990. Suppose further that Y 
is determined by X = income and U = taste for cars, according to this 
rule: 


pall ifa+bX+U>0, 
“10 ifatbX+Uso. 


Here a and b are constants, X and Y are observable, while U is an 
unobserved variable. How does the probability of car purchase vary, if 
at all, with income? Suppose that U — standard normal, with X and U 
independent. The question is, what is Pr(Y = 1{x) at income level x? 
Now 


Y=] © (a+tb+U>0 © U> (а + dx). 
So 
Pr(Y = 1|x) = Pr[U > —(а + bx)|x] = 1 — F[-(a + 5x)] 
= F(a + bx), | 


where F(-) denotes the standard normal cdf. Here F(a + bX) is not the 
cdf of income, but rather the CEF of the binary variable Y, conditional 
on the continuous variable X. Observe how the assumption of indepen- 
dence was used to assert that U|x ~ standard normal for every x. 


6.4. Mean-Independence and Uncorrelatedness 


We turn to less stringent concepts of the absence of a relation between 
the variables in a bivariate distribution. 


Mean-Independence 

In any bivariate probability distribution, we have 
E(Y|x) = IESUS dy. 

We say that Y is mean-independent of X iff 
Е(Ү|х) = u$ for all x such that f,(x) # 0, 


where рў does not depend on x. In other words, the conditional expec- 
tation of Y given x is the same for all x-values; that is, the population 
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conditional mean does not vary with—"is independent of"—X. One 
implication of mean-independence is immediate: the marginal expec- 
tation of Y is 


ну = Ex[E(Y|X)] = E(y$) = u$. 


by the Law of Iterated Expectations (T8, Section 5.2). That is, if the 
conditional expectation of Y is the same for all values of X, then the 
marginal expectation of Y coincides with that common conditional 
expectation. 

Another implication is: 


М1. If Y is mean-independent of X, and Z = A(X) is a function of X 
alone, then Y is mean-independent of Z. 


Proof. If h(-) is one-to-one, then the implication is immediate, because 
E( |z) is equivalent to E(-|x) with x = h^ (а). Otherwise, let i(k) denote 
the set of all i such that z(x;) = z,, and let 2,4, denote summation over 
all i in that set. Then the joint probability mass at the point (z+ yj) is 
f*G y) = Хохь yj), so that the marginal probability mass at z, is 


SiG) = È fr y) > [> кь ә] ЕЗ > fils. 


i(k) 


So the conditional probability mass for y; given z, is 
giGjlz) = f*(z, I FEE) 3 [> OK) ГА Уле) 
zi > LASALA LAS 
say, with Diniz = 1, Then 
E(Y|z) = 2 у (012) = z yi [> exon 
= У wa [= 220310) | = У waE(Y|x). 
xA) J А) 


If E(Y|x;) is constant across all т, then E(Y|z,) will be constant at that 
same value across all k. m 


What is the connection between independence and mean-indepen- 
dence? If two distributions are the same, they must have the same mean, 
so independence implies mean-independence. But two distributions 
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may have the same mean, and yet have different variances, or third 
moments, or medians. So mean-independence is weaker than stochastic 
independence, because it refers only to the expectation rather than to 
the entire distribution. Nevertheless, to the extent that economists do 
interpret "the relation" between Y and X to refer to the population 
conditional mean function, then it is mean-independence rather than 
stochastic independence that should serve as the natural baseline for 
“no relation." 

Mean-independence is not a symmetric relation: if Y is mean-inde- 
pendent of X, then X may or may not be mean-independent of Y. 


Example: Three-point Distribution. Suppose that (X, Y) is discrete, 
with f(x, y) = 1/3 at each of three mass points, namely (1, — 1), (0, 0), 
and (1, 1). Then 


E(Y|x = 1) = 0 = E(Y|x = 0), 
but 


E(Xly=-1)=1, Е(Х|у = 0) = 0, Е(Х|у= 1) = 1. 


Uncorrelatedness 


Recall the definition of the covariance in a bivariate probability distri- 
bution: 


C(X, Y) = E([X — EQOJY — En. 


We say that Y is uncorrelated with X iff C(X, Y) — 0. Clearly, uncorrelat- 
edness is a symmetric relation. 

What is the connection between uncorrelatedness and mean-indepen- 
dence? Two results will shed some light: 


M2. IfYis mean-independent of X, then Y is uncorrelated with X. 


Proof. By T12 (Section 5.9), С(Х, Y) = C[X, E(Y|X)]. If E(Y|X) = py, 
for all X, then C(X, Y) = C(X, uy) = 0, because py is a constant. 8 


M3. If Y is mean-independent of X, and Z = АХ) is a function of X 
alone, then Y is uncorrelated with Z. 
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Proof. By M1, Y is mean-independent of Z. Use M2 with Z playing 
the role of X. m 


Clearly, mean-independence is stronger than uncorrelatedness. The 
three-point distribution example illustrates this. With f(x, у) = 1/3 at 
each of the three mass points, namely (1, —1), (0, 0), and (1, 1), we 
calculate C(X, Y) — 0. But X is not mean-independent of Y in that 
example. 

Indeed, Y can be uncorrelated with many functions of X without 
being mean-independent of X. What is true is that: 


M4. If Y is uncorrelated with E(Y|X), then Y is mean-independent 
of X. 


Proof. Let Z = E(Y|X), so Y = Z + e with C(Z, є) = 0. Then C(Y, Z) = 
C(Z + e, Z) = C(Z,Z) + C(Z, e = V(Z) = 0, with equality iff Z is 
constant. W 
6.5. Types of Independence 
One useful way to distinguish among uncorrelatedness, mean-indepen- 
dence, and stochastic independence is in terms of the joint moments of 
the probability distribution, namely the E(X'Y*), where т and s are 
positive integers. We see that: | 
If Y is uncorrelated with X, then E(XY) = E(X)E(Y). 

Proof. C(X, Y) = E(XY) — E(X)E(Y). ш 
If Y is mean-independent of X, then E(X'Y) = E(X^)E(Y) for all r. 

Proof. Let Z = X" = h(X). Then by M3, CZ, Y) = 0. m 


If Y is independent of X, then E(X'Y^) = E(X")E(Y?) for all r, s. 


Proof. By 13, X" and Y' are independent; hence they are uncorre- 
lated. m 
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Another informative distinction between uncorrelatedness and mean- 
independence concerns prediction: 

If Y is mean-independent of X, then E(Y|X) = E(Y) for all X, the CEF 
of Y given X is a horizontal line, and the best predictor of Y given X is 
E(Y). If Y is uncorrelated with X, then E*(Y|X) = E(Y) for all X, the 
BLP of Y given X is a horizontal line, and the best linear predictor of Y 
given X is E(Y). 

Recalling the discussion of deviations in Sections 5.3 and 5.4, we may 
now Say that: 


Ife = Y — E(Y|X), then € is mean-independent of X. 
If U = Y — E*(Y|X), then U is uncorrelated with X. 


In applied econometrics, one sometimes reads that "X and Y are 
uncorrelated, so there is no relation between them." This statement is 
ambiguous. Only if "the relation" of Y to X refers to the BLP rather 
than to the CEF would such a statement be appropriate. One also reads 
that "there is no linear relation" between Y and X. That statement too 
is ambiguous, and should not be interpreted to say that "there is a 
nonlinear relation" between Y and X. Properly speaking, "no linear 
relation" means that C(X, Y) = 0, so the best-fitting linear relation 
between Y and X is a horizontal line. And that leaves two possibilities 
open: perhaps the CEF is also a horizontal line (Y is mean-independent 
of X), or perhaps the CEF is a curve, the best linear approximation to 
which happens to be a horizontal line. 


Example. For the three-point distribution introduced in Section 
6.4, the CEF E(X|Y) is V-shaped (nonlinear), while the BLP E*(X|Y) is 
horizontal because C(X, Y) = 0. 


Example. Let Y = earnings and X = age. Because of natural 
life-cycle development, it is plausible that E(Y|X) plots as an inverted 
U. If so, it is quite possible that E*(Y|X) is horizontal. 


6.6. Strength of a Relation 


In some contexts, it is interesting to measure the extent of dependence 
between Y and X in a bivariate population. It seems natural to base such 
measures on the analysis of prediction. 
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Recalling the definition of the correlation coefficient (Section 5.1), we 
show: 


CAUCHY-SCHWARTZ INEQUALITY. 


If p = C(X, YV[VV(X)VV(Y)], then 0 = р? = 1. 


Proof. From B = C(X, Y)/V(X) and Eq. (5.9), it follows that 
p? = CX, Yy(Voovo = g'vyvo) = (o) - vvv». 
So р? = 1 – V(UYV(Y) But 0 = V(U) < VY). m 


If p? = 1, we say that X and Y are perfectly correlated. This p°, the 
population coefficient of determination, measures the proportional reduc- 
tion in expected squared prediction error that is attributable to using 
the BLP E*(Y|X) rather than the marginal expectation E(Y) for pre- 
dicting Y given X. It is commonly used as an indicator of the strength 
of "the linear relation" between Y and X in a population. 


Example. For the roof distribution, 
p? = (71/144) [[(1/144)(11/144)] = 1/121. 


A related measure relies on the CEF rather than on the BLP. Refer- 
ring to the Analysis of Variance formula (T10, Section 5.2), define the 
correlation ratio for Y on X as 


? = Vyx[E(Y | /V(Y) = 1 — Ex(V(' | VV(Y). 


Clearly 0 < n? = 1. This тё measures the proportional reduction in 
expected squared prediction error that is attributable to using the CEF 
E(Y|X) rather than the marginal expectation E(Y) for predicting Y given 
X. It may be used as an indicator of the strength of the relation of Y to 
X, when “the relation" is interpreted to be the CEF. Because the BLP 
solves a constrained version of the prediction problem solved by the 
СЕЕ, it follows that p? = тү, with equality iff the CEF is linear. 

One should not confuse either of these measures of strength with 
measures of steepness such as the slope of the BLP, dE*(Y¥|X)/aX = В, 
or the slope of the СЕЕ, dE(Y|X)/aX. In most economic contexts, slope, 
rather than strength, will be of primary interest. 
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Exercises 


6.1 Suppose that X, and X; are stochastically independent Bernoulli 
variables, with parameters р, and р respectively. Let Y = ХХ, and W = 
X, + Ху. Determine whether each of the following statements is true or 
false. 


(a) The random variable Y is distributed Bernoulli with parameter 


Pife- 
(b) The expectation of И? is equal to f; + p3. 


6.2 Two economists know the joint distribution of X = price and Y = 
quantity. One decides to predict quantity given price, using the BLP 
E*(Y|X); his prediction error is U = Y — E*(Y|X). The other decides to 
predict price given quantity; her prediction error is V = X — E*(X|Y). 
Let oy, = C(X, Y), oy, = C(U, V), and p = correlation of X and Y. 


(a) Show that oy, = —(1 — р?) суу. 
(b) What does that result imply about the sign and magnitude of Oyy 
as compared with the sign апа magnitude of Oxy? 


6.3 Suppose that Z = XY, where X and Y are independent random 
variables. Show that V(Z) =V(X)V(Y) + E((X)V(Y) + E7(Y)V(X). 


6.4 Suppose that Z = XY, where Y is mean-independent of X and the 
conditional variance of Y given X is constant. Show that the conclusion 
in Exercise 6.3 is still correct. 


6.5 Suppose that Y = Z — X is independent of Z and of X. Show that 
Y is a constant. 


6.6 Suppose that Y — Z/X is independent of Z and of X, where X and 
Z are positive random variables. Show that Y is a constant. 


6.7 Suppose that X and W are independent random variables with 
E(X) = 0, Е(Х?у = 1, E(X?) = 0, E(W) = 1, EW?) = 2. Let Y = W + 
WX. 

(a) Find the CEF E(Y|X) and the BLP E*(Y|X). 

(b) Change the assumption Е(Х?) = 0 to E(X*) = 1. Find the СЕЕ 

E(Y|X) and the BLP E*(Y|X). 

(c) Which relation remained the same in going from (a) to (b)? Which 

changed? Why? 
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7.1. Univariate Normal Distribution 


Recall from Section 2.3 that the random variable Z has the standard 
normal distribution iff its pdf is 


fà = exp(-Z/2y V (2m). 
It can be shown by integration that 
Е(2) = 0, E()-21, E()-20, EQ95-3. 


Suppose Z ~ standard normal, and let X = a + bZ, where a and b 
are constants with b > 0. By the linear function rule (T1, Section 3.3), 


E(X) = а + bE(Z)=a, V(X) = 8V(Z) = В, 
so ме may as well rewrite the linear function as 
X=ptoZ, witha > 0. 
As in Section 2.5, to find the pdf of X, first find its cdf: 
G(x) = Pr(X = x) = Рур + oZ < x) = Pr[Z = (x — uyo] 
= F(2, 


with z = (x — y/o; here F(-) denotes the standard normal cdf. So the 
pdf of X is 


g(x) = дС(х)/дх = AF (z)/dx = [OF (z)/dz](dz/dx) = f(zya, 
with z = (x — p)/o. That is, 
g(x) = a (Qu) ехр(—22/9) = exp{—[(x — uo] /2Y/ V (2r0?). 
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We write this as X ~ (i, o°). This is a two-parameter family, the 
(general) univariate normal distribution with parameters р and o°. The 
standard normal distribution is the special case X ~ N(0, 1). Observe 
that E(X) = u and V(X) = of, as the notation suggested. 

What we have shown is that if Z ~ N(0, 1) and X = р + oZ with 
с > 0, then X ~ N(p, o°). But, as can be verified, the argument reverses: 
if X ~ Хр, с?) and Z = (X — р)/с, then Z ~ (0, 1). The conclusion is 
that 


X ~ N(p,0°) iff (X — po ~ N(O, 1). 


It follows that there is no need to tabulate cdf's for general univariate 
normal distributions: the N(0, 1) cdf table suffices to provide the prob- 
abilities of events for any N(p, o?) distribution. Just translate the event 
in terms of X into an event in terms of Z. (Remark: For b < 0, use the 
fact that -Z ~ N(0, 1).) 

An immediate implication is that a (nontrivial) linear function of a 
normal variable is itself normal. That is: 


If X ~ N(p, o°) and Y = a + bX with b # 0, then Y ~ N(a + bp, bo’). 


Proof. It suffices to show that Y = a* + b*Z where Z ~ (0, 1). Let 
Z = (X — p)/o, so X = u + oZ. Then Y =a + bX =a + bw + 02) = 
(a + bp) + boZ = a* + b*Z. m 


The trivial case b = 0 is ruled out because it would make Y be a constant. 
Some writers allow b — 0, and would say that Y — a has a "degenerate 
normal distribution." 


7.2. Standard Bivariate Normal Distribution 


Suppose that U}, U, are independent (0, 1) variables. Let p be any 
constant with |p| « 1, and let 


2,= Uy, Z= pU, + М(1 — р?)0». 
We show that the joint pdf of (Z,, Za) is 


(11) © gn 2) = Qu) (1 — р?) ""exp(-w/2), 
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with 
(2) w= (а + 22 — pay - p. 
Proof. 'The joint pdf will be the product of conditional and marginal 
pdf's: g(z, 22) = А(22|21) fi(z)). Clearly 2, ~ (0, 1) so 
fii) = Ди), 


with и, = z,, where f(-) denotes the N(0, 1) pdf. Next, for given Z, = 
Zp we see that 


Zp = pz, + V(1— p^)Us 


is a linear function of U;, with U, ~ N(0, 1) independently of Z,. So 
by the linear-function result in Section 7.1, we know that Z.|z; is nor- 
mally distributed, with Е(2,| 21) = pz, and V(Z,|z;) = (1 — p^). That is, 
Zy|z ~ N(pz,, 1 — р?). So the conditional pdf is 


|) = ftu VY — p”), 
with us = (zə — pz,)/V(1 — p°). So the joint pdf is 
gv 29) = Ќи) fius V (0 — p?) 
= (2m) (1 – pg)? exp[- GÀ + us)/2]. 
But 
ui + ug = ж + (% — рә) (1 — p) 
= [0 - р?) + 22 + р?д — 2pzzeV(1 — р?) 
= (д + 25 — 2pzız2)/(1 — p°). m 


The pdf in Eqs. (7.1)-(7.2) defines a one-parameter family, the stan- 
dard bivariate normal, or SBVN, distribution with parameter p. We write 
this as (Z,, 25) ~ SBVN(p). Figures 7.1, 7.2, and 7.3 plot the SBVN(p) 
distribution for three values of p; the variables are labeled x and y. 

It is easy to verify that the derivation reverses, so that 


(21, 25) ~ SBVN(p) iff Uj, 1 are independent N(0, 1) variables, 


where U; = 21, Up = (2, ~ pZ,)/V (1 — p°). Consequently, we can deduce 
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say) 


Figure 7.1 SBVN distribution: joint pdf, p = 0. 


Sx, NY 


v 


Figure 7.2 SBVN distribution: joint pdf, p = 0.6. 
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Ravi 


Figure 7.3 SBVN distribution: joint pdf, p = —0.6. 


properties of the SBVN(p) distribution directly by relying on the rep- 
resentation in terms of 1, Up. If (Z,, Z2) ~ SBVN(p), then 


E(Z,) = E(U,)=0, VWZ)-WU)-7l 
С(2\, Ze) = CU, pU; + VÀ = eU; 
= pC(U, Uy) + V(1 = p?)C(U,, U2) = p. 
So, as the notation suggested, p is the correlation of Z, and Zs. Further, 
Z,~N(,1),  ZjZ| ~ N(pZ1, 1 — p^, 
and 
о=0 e Z,|Z, ~ N(0, 1) for all Z, 
> Zand 2» are independent. 


Of course the roles of Z, and Z, can be reversed. 

So in the SBVN distribution, the marginals are normal, the condi- 
tionals are normal, the conditional expectation functions are linear, the 
conditional variance functions are constant, and uncorrelatedness is 
equivalent to stochastic independence. Figure 7.4 plots selected contours 
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/ EC v) 


Figure 7.4 SBVN distribution: contours and CEF's, p = 0.6. 


of the SBVN(0.6) distribution along with the two CEF's; the variables 
are labeled X and Y. 

We can calculate the probabilities of various events in an SBVN 
distribution by translating into a standard normal event. 


Example. Suppose (X, Y) ~ SBVN(0.6). Find Pr(Y = 2|x = 2). 
We know that U = (Y — px)/V(1 — р?) ~ N(O, 1), so at x = 2, we have 
U = (У - L2)0.8, and Y = 2 & U = 1. So РҮ € |х = 2) = F(1) = 
0.841. 


7.3. Bivariate Normal Distribution 


Suppose that (Z,, Z2) ~ SBVN(p). Let is р, т, > 0, 05 > 0 be constants, 
and let 


X, = py + UYAT Xə = + 0529. 
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We show that the joint pdf of X,, X, is 
(.3) Фб, x2) = exp(-w2) m), 
with 
(7.4)  w-(ütz-2puu)- p) 
z (xi — uo, 
Z9 = (хә — Bo), 


y? = 0105 (1 — p°). 


Proof. 'The joint pdf will be the product of conditional and marginal 
pdfs: ф(ху, x2) = ф(х)ћ, (хх). Now X, = p; + o,Z, where Z, ~ (0, 1), 
so 


ф(х) = Ка), 


with z; = (x, — u4Yg,. Turn to X; = ро + 02). For given X, = ху, that 
is, for given Z, = (x, — ,)/o, = 2;, we see that 


2.15 ~ Хра, 1 р?), 
and that X; is а linear function of Z;. So Х,|ху ~ normal, with 


E(Xs|xi) = н» + osE(Zo]zi) = ро + род, 


V(Xs]xi) = 05У(22|41) = o$(1 — p°). 
That is, 


Х.х ~ Np + розд, 03(1 — p”)), 
with z, = (xy — p4)/g,. So the conditional pdf is 
ho(xo|x1) = f(ug)/fo2V(1 р?)), 
with us = [xs — (Ia + розд, )(0У(1 — p?)]. So the joint pdf is 
(x1, хә) = фб) |) = fer) Аи) слом — p). 
Multiply out and rearrange. Ш 


The pdf in Eqs. (7.3)-(7.4) defines a five-parameter family, the (gen- 
eral) bivariate normal, or BVN, distribution, with parameters п, р, с?, 
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оў, and оү» = роо. We write this as (Xj, X2) ~ BVN(t:, be, оў, o?, 
з). It is easy to verify that the derivation reverses, so that 


(Xi, X2) ~ BVN(py, н», o, 0$, Gig) iff (21, 25) ~ SBVN(p), 


where Z, = (X, — pi)yai 2 = (X2 — us5)yas, p = 0/0005). 
Referring back to Section 7.2, we have an equivalent statement: 


(Xis X9) ~ BVN(py, Be, о, оў, 912) 

iff U, and U, are independent (0, 1) variables, 
where 

О, = (X; — Hi), 

Us = [X2 — (be + posZ))y [os V(1 — 95). 


р = c,5/ (0,05). 


7.4. Properties of Bivariate Normal Distribution 


As a result of the derivation and its reversal, we can directly deduce 
properties of the general bivariate normal distribution. The marginal 
distribution of X, is normal: X, ~ (р, of). The conditional distribution 
of Хз given x, is normal: X;|x, ~ [н + posz,, 02 (1 — p^)], with z; = 


(x; = w)yo;. 
Let 9,9 = рос, and write 


Mo + розд, = By + pos(xi — puo, 
= [иә — (pos/o)i] + (pos/o.) x, 


= а + Bx, 
where а = р — Bui В = posco, = oj9/a7. Also, 
ox — р?) = 021 — 0110102) = o3 — B'oi = o^, 
say. So we can write 


Xy|xy ~ Ма + Bo, o°). 
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As for the moments, we have found that 
ЕХ) = py + O,E(Z,) = рл, 
V(X) = o1V(Z,) = оң, 
C(X,, X2) = о\о»С(2\, 25) = 0102P = Dip. 


And we also have seen that 0), = 0 > p = 0 > Z, and Z, independent 
=> X, and X; independent. 
We restate these properties for reference. 


If (i, Xə) Es BVN(p, be, oi, оз, 95), then: 


P1. The expectations, variances, and covariance are: 
E(X\) = рл, E(X2) = рә, 
V(X) = о, У(Х) = о, СХ, Xz) = Ore, 


thus justifying the symbols used for the parameters. 


P2. The marginal distribution of X, is normal: 


X, ~ Хр, оң). 


P3. The conditional distribution of X; given X, is normal: 
X,|X, ~ Мо + BX), 0°), 
where | 
В = 0/01, а= р – BR, о? = оз – Вот. 
Observe that the СЕЕ is linear and the conditional variance is constant. 


P4. Uncorrelatedness implies stochastic independence: If с» = 0, 
then X, and Х are independent. 


Of course, the roles of X, and X, can be reversed throughout. We 
can also see that 


P5. A (nontrivial) pair of linear functions of X, and X; is distributed 
bivariate normal: If Y, = a, + 6X, + c,Xs, and Y, = as + БХ; + csXs, 
where the аз, bs, and c's are constants, with b,c — bsc, ¥ 0, then 
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(Y,, У) ~ BVN. The condition b,cy — bsc, ¥ 0 rules out constancy of 
either variable or perfect correlation between the variables. Some writers 
drop the condition and refer to “degenerate bivariate normal distribu- 
tions.” 


Proof. it suffices to show that Y, and Y; can be expressed as linear 
functions of W, and W, where (W,, Wo) ~ SBVN(p). m 


We can calculate the probabilities of various events in a BVN distri- 
bution by translating into a standard normal event. 


Example. Suppose (X, Y) ~ BVN(2, 4, 6, 5.5, 3). In order to 
find Pr(Y < 2|x = 2), calculate В = 3/6 = 0.5, а = 4 — 0.5(2) = 8, о? = 
5.5 — (0.5)? 6 = 4. We know that Z = [Y — (3 + 0.5x)/V4 ~ .N(0, 1), 
so at x = 2, we have Z = (Y — 4/2, and Y < 2 & Z = —1. We find 
Pr(Y = 2|x = 2) = F(-1) = 1 Е(1) = 1 - 0.841 = 0.159. 


7.5. Remarks 


* Because it has linear CEF's, constant conditional variances, normal 
marginals, and normal conditionals, the BVN is convenient for illustra- 
tion of theoretical concepts. There is another reason for our interest: 
the BVN arises as the limiting joint distribution of sample means in 
random sampling from any bivariate distribution (see Chapter 10). 

* There is a lot more to a BVN distribution than marginal normality 
of its components. Figure 7.5 plots a non-BVN distribution that has 
normal marginals. The joint pdf is: 


ф(х, у) = 2zf(x)f(y), wherez = lif xy > 0,22 0 if xy = 0, 


and f(.) denotes the N(0, 1) pdf. The joint pdf is nonzero only in the 
NE and SW quadrants. The marginal pdf of X is 


$i) = ji (x, y) dy = | 22f(x)fly) dy = 2 fix) l^ 2f) dy. 
For x > 0, zf(y) = f(y) for y > 0, and (у) = 0 for y = 0. So for x > 0, 


|: fO) dy = | - fly) dy = 12. 
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Фок, ЧУ 


Figure 7.5 Non-BVN distribution with normal marginals. 


Similarly for x = 0. So ф(х) = 2f(x)(1/2) = f(x). That is, X ~ N(0, 1). By 
symmetry, Y ~ N(0, 1), so both marginals are normal. But the joint 
distribution is not BVN, and indeed the CEF's are not linear, and the 
conditional pdf's are not normal. 

* In a BVN distribution, uncorrelatedness implies independence. But 
two univariate normal variables may be uncorrelated without being 
independent. After all, their joint distribution may not be bivariate 
normal. 


Exercises 


7.1 The random variable X is distributed N(3, 16). Calculate each of 
the following. 


(a) Pr(X < 7). (b) Pr(X > 5) (9) Pr(X = 3). 
(dPr-1«X-«11. (e) Pr(X= 0.  (f)Pr(X|s 3). 


7.2 The random variable X is distributed N(3, 16). Let Y = 3 — X/4. 
Calculate Pr(3.25 « Y « 4.25), and Pr(Y > X). 
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7.3 The pair of random variables X and Y is bivariate-normally dis- 
tributed with parameters py = 3, ру = 4, ox = 9, о? = 20, and Oxy = 
6. Calculate each of the following. 


(a) E(Y|x = 3). (b) E(¥ |x = 6). (с) V(Y|x = 3). 
(d)V(Y|x = 6) (е) PY «8x = 3). (Е) Pr(Y = 8|х = 6). 
(g) Pr(Y = 8). 


7.4 For the setup of Exercise 7.3, let U = Y + X. Calculate E(U), V(U), 
and Pr(U = 3). 


7.5 Suppose that Z ~ (42, 2500), W ~ .N(0, 500), that Z and W are 
independent, and that X = Z + W. Calculate the conditional expectation 
function E(Z| X). How do you know that the СЕЕ is linear? 


7.6 Suppose that Y, = X + Z,, Ys = X + Zs, where X, Z,, Zo are 
independent random variables with E(X) = 100, V(X) = 100, E(Z,) = 
0, V(Z,) = 20, E(Z;) = 0, V(Zs) = 40. 


(a) Find the best linear predictor of Y, given Yj. 

(b) Find the expected squared prediction error that results when that 
BLP is used. 

(c) Suppose that Y, and У, are bivariate-normally distributed. Could 
you improve on your predictor? If so, how? If not, why not? 


7.7 Suppose that X = wage income and U = nonwage income are 
bivariate-normally distributed with E(X) = 30, E(U) = 5, V(X) = 185, 
V(U) = 35, C(X, U) = 15. Also, total income is Y = X + U. All variables 
are measured in thousands of dollars. A person reports her total income 


to be y = 20. Calculate the probability that her wage income is less than 
20. 
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8.1. Random Sample 


We have been discussing probability distributions, that is, populations. 
We now turn to samples, which are sets of observations drawn from a 
probability distribution. We continue to treat the population probability 
distribution as known, deferring until Chapter 11 the practical problem 
that really concerns us, namely how to use a single sample to estimate 
features of an unknown population. 

We start with a univariate probability distribution for the random 
variable X. Let X,, . . . , X, denote independent drawings from that 
population. That is, they are the random variables that are the outcomes 
when the same experiment is repeated n times independently. Then 
the random vector X = (X,,..., X,)' is called a random sample of size 
п оп the variable X, or from the population of X, or from the probability 
distribution of X. The values that X takes on will be denoted as x = (x), 

< > X4). (Note: Boldface type is used to denote vectors, which are 
generally to be thought of as column vectors, and the prime, ', is used 
to denote transposition.) The concept of (stochastic) independence, 
introduced in Section 6.2 for the bivariate case, is now being applied to 
the multivariate case: a set of random variables is independent iff their 
joint pdf (or pmf) factors into the product of their marginal pdf's (or 
pmf's). 

Strictly speaking, the term "random sample" refers to the random 
vector X, but in common usage a single draw, x, on that vector is also 
called a random sample. Observe that in the term “random sample,” 
the adjective “random” has a much stronger meaning than it did in the 
term “random variable.” 
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If X = (X,,...,X,)' is a random sample on X, then the X/s are 
independent and identically distributed. If f(x) is the pmf or pdf of X, then 
the joint pmf or pdf for the random sample X is 


&n(X) = gx, ® жуу. 9 Xn) = fix) p fat) = fox) en fx) 
= П А), 
using f,(-) = f(-) for all 2, and independence across i. (Note: The symbol 
П, is shorthand for П? ,.) 


Here are some examples of the joint pmf or pdf for random samples. 
(1) Bernoulli(p). The pmf of the random variable X is 


Дх) = pu р)" forx = 0, 1, 


with f(x) = 0 elsewhere. Then for апу n X 1 vector x whose elements 
are all either 0’s or I’s, 


g) = II [а 2 pm - [П 21 [П (1 — p) | 
= parr — рбет) 


= pa = pE, 


where 2; is shorthand for 27_,. For any other х, g,(x) = 0. 
(2) Normal(p, o°). The pdf of the random variable X is 


fix) = (216?) ? exp(-[(x — p)/o]?/2}. 
So for any n X 1 vector x, 
En(X) = (220?) "? exp [- 2 [(x; — yero. 
(3) Exponential(X). 'The pdf of the random variable X is 
Кх) = Xexp(-Ax) forx > 0, 


with f(x) — 0 elsewhere. So for any n X 1 vector x whose elements are 
all positive, 


g.(xX) = A” exp (^ > x). 


For any other x, g,(x) = 0. 
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8.2. Sample Statistics 


Let T = A(X,, ... , X,) = A(X) be a scalar function of the random 
sample. Then T is called a sample statistic. The values that T = h(X) takes 
on will be denoted as ¢ = h(x). Sample statistics include the sample mean, 


X-2(Xy +... +X, n= (Im) > Xi, 


the sample variance, 


S? = (1/n) È (X, — Xy, 


the sample raw moments (for nonnegative integers r), 


M! = (m) E Xi, 


and the sample moments about the sample mean (for nonnegative inte- 
gers r), 


M, = (1т) È (X, – Xy. 


The formulas used here are convenient when the observed data come 
in a list, the elements of which need not be distinct. If the data come in 
the form of a frequency distribution, one may use equivalent formulas 
as in Chapter 1. (Caution: Here i runs from 1 to n, the number of 
observations; in Chapter 1, i ran from 1 to J, the number of distinct 
values.) 

Other sample statistics include the sample maximum, and the sample 
proportion having X less than or equal to some specified value c. 

Any sample statistic T = A(X) is a random variable, because its value 
is determined by the outcome of an experiment. In random sampling, 
the probability distribution of T, known as its sampling distribution, is 
completely determined by A(-), f(x), and n. 

Evidently it is possible to derive the sampling distribution of T = A(X) 
from knowledge of f(x) and n. As an illustration, consider the sample 
mean in random sampling of size 2 from a continuous distribution 
whose pdf and cdf are f(x) and F(x). The cdf of T = (X, + Х,)/2 is 


G(t) 


Pr(T = t) = Pr(X, + X, = 2t) = Pr(X, = 9t — X) 


2t-x, 


i 


f(xs) de, а 


Гоо d d = [pea 
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= jl Ах1)Е(21 — хр) dx, = [ F(2t — x)f(x) dx, 
say. So the pdf of T is g(t) = dG(t)/at = 2 f. f(2t — x)f(x) dx. 


Example. Suppose that X ~ exponeniial(A), so thai its pdf is 
f(x) = X exp(—Ax) for x > 0 and f(x) = 0 elsewhere. Then for t > 0, 


2t 


2 À exp[-A(2t — x)] А exp(—Ax) dx 
0 


Ш 


gt) 


Qt 
2h? exp(—2At) | dx 
о 


= 4\1 exp(—2Ad), 


with g(t) = 0 elsewhere. The integration here is confined to the interval 
0 <x < 2t because either f(x) or f(21 — x) is zero elsewhere. 


The same logic applies to other sample statistics, and extends to random 
sampling with n > 2. In many cases, we will report rather than derive 
the sampling distributions. 


8.3. The Sample Mean 


We report the sampling distributions of the sample mean in random 
sampling, sample size n, for three populations: 


(1) If X ~ Bernoulli(p), then Y ~ binomial(n, p), where Y = nX. 

(2) If X ~ N(p, o°), then X ~ №, o°). 

(3) If X ~ exponential(A), then W ~ chi-square(k), where k = 2n and 
W = ААХ. 


The chi-square distribution (with parameter А) will be discussed in Sec- 
tion 8.5. Its cdf is tabulated in Table A.2. 

How does one use such information to calculate the probabilities of 
events defined in terms of X, that is, to get points on the cdf F,(c) = 
Pr(X < c)? The approach is familiar: translate the event into one whose 
probability is directly tabulated. We illustrate the procedure with our 
three examples. 
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(1) Bernoulli(p). For given n and р: 
Xzsc © Y=nX Sue =e, 


say. Go to a table, found in some statistics texts, which gives the binomial 

(n, p) cdf G(-; n, p), say, and read otf F,(c) = G(c*; n, p). Alternatively. 

use the binomial(n, р) pmf formula and sum up the appropriate terms. 
(2) Normal(y., o°). For given p and o°: 


Х=с © Z-(X- pol Vn) S(e- p)/(o/Vn) = (№, 


say. Go to Table A.1, which gives the standard normal cdf ®(.), say, and 
read off F,(c) = ®(c*). 
(3) Exponential(X). For given п and À, calculate k = 2n. Now 


Х=с © W -kAX S Юс = c*, 


say. Go to Table A.2, which gives the chi-square(k) cdf G,(-), say, and 
read off F,(c) = G,(c*). 

We see that the sampling distribution of the sample mean differs as 
the population differs, and as the sample size differs. What can be said 
in general—that is, without reference to the specific form of the parent 
population—about the expectation and the variance of the sample 
mean? Now 


X = (I/n)X, +... + (Yn)X, 


is a linear function of the n random variables X;. Extending what we 
know about expectations and variances of linear functions of two vari- 
ables (T5, Section 5.1) to the n > 2 case, we calculate 


E(X) = (IMEX) +... (Vn)E(X,) = (Yn)p + --- + (/n)p = p, 
V(X) = (Un V(X,) +--+ (Ln?) V(X,) + (nC, Ху) +... 
= (1/т2)с? Peas (1/n)o” = (ninjo? = о°т. 
This establishes our key result: 
SAMPLE MEAN THEOREM. In random sampling, sample size n, 


from any population with E(X) = p and V(X) = o°, the sample mean 
X has E(X) = p and V(X) = с?т. 
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8.4. Sample Moments 


The Sample Mean Theorem must cover other sample statistics as well, 
in random sampling. 


Sample Raw Moments 


Recall from Section 3.2 the definition of the population rth raw moment: 
E(X) = p,, say. The corresponding sample raw moment is 


M; = (1т) È Xi. 


Let Y = X', so Y; = Xi. Then M; = (1/n)2,Y; = Y is a sample mean, and 


Y = (Y,,...,Y,)' isa random sample on the variable Y. So the theorem 
must apply to М; = Y. Now 
E(Y) = E(X) = p, 


V(Y) = Е(Ү?) — ЕЗҮ) = E(X”) - [EX]? = ps, — (pi). 
So 


Е(М,) = uw, V(M;) = [ps — (pz) Vn. 


Sample Moments about the Population Mean 


Recall also the definition of the population rth central moment: 
E(X — р)" = y, say. The corresponding sample statistic is 


М* = (Vn) È (X; — py. 
Let Y = (X — р)", so Y; = (X; — р)". Then M* = (l/n)E;Y, = Y is a sample 


mean in random sampling on the variable Y, so the theorem must apply. 
Thus 


E(M¥) = р, — V(M?) = (р, — н). 
For example, for r = 2, we have M¥ = (1/n)Z(X; — p)’, with 
E(M¥) = pu, — V(MB) = (ра — pin. 


We refer to МЎ as the ideal sample variance: in practice М cannot be 
computed, because p is unknown. 
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Sample Moments about the Sample Mean 
Consider the rth sample moment about the sample mean: 


M, = (Vn) È (X; – Xy. 


Let Y = (X — Xy, so Y; = (X, — Xy, and M, = Y. However, the Ү;5 are 
not independent random variables. To see this, let 
U,-X,- X, Ü= X-X, 
and calculate 
C(U,, Us) = C(Xy, Xy) + C(X, X) ~ C(X,, X)- C(Xs, X) 
= 0 + V(Xyn — V(Xyn — V(Xyn = —V(Xyn. 


The U;/s are correlated and hence cannot be independent. So the Y; = 
U; are presumably not independent either. If so, the Sample Mean 
Theorem does not apply to sample moments about the sample mean. 
(To reinforce the presumption, consider the case п = 2: there U, = 
—U,, so Y, = XY,) | 

The expectation and variance can still be obtained by brute force. We 
confine attention to the sample variance 


S? = М, = (Un) X, (X; - Xy. 
As a matter of algebra, 
У(Х, - XP = EK -u - (X - wr 
= У(Х, ш + nA - py? - 2K - н) ZG — н) 
-X(X- -nX-u. 
So Mə = Mx — (X — џ)?, whence 
Е(М») = E(M$) — EQ - р)? = py — V(X) = us – usn 
=o7(1 — In). 
In similar fashion it can be shown that 
V(Ms) = (n ~ 1), — [(n ~ 3)/(п — 1)]p5}/n° 
= (p4 — poyn — 2(p4 — 2u3ynm. + (р, — 3yg)/n°. 
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A couple of remarks on the relation between the sample variance and 
the ideal sample variance are useful here. First, because M, = МЎ — 
X — и), it follows that М» = М in every sample. Second, if т is large, 
then 


Е(М») = po = о? = Е(М%), — V(M3) = (ра — p3)n = V(M3), 


suggesting that when the sample size is large, the distributions of the 
statistics M; and M¥ will be quite similar. We will formalize this sugges- 
tion in Section 9.6. 


8.5. Chi-square and Student's t Distributions 


As a preliminary to further discussion of sampling from a univariate 
normal population, we introduce two other univariate distributions. 

(1) Chi-square Distribution. If Zi, . . . , Z, are independent N(0, 1) 
variables, and W = >/_,Z?, then the pdf of W is 


gw) = (1/2)(w/2) 7! exp(—-w/2YT(k/2) for w > 0, 
with g,(w) = 0 elsewhere. Here Г(т) is the gamma function: 
Г(19) = Мт, ГО) = 1, Tn) = (п – i(n- 1). 


We write this situation as W ~ x7(k). The pdf defines the chi-square 
distribution, a one-parameter family. Figures 8.1 and 8.2 plot the pdf 
for selected values of k, while the cdf is tabulated in Table A.2. 

The derivation reverses: if W ~ x7(k), then W can be expressed as 
the sum of squares of k independent N(0, 1) variables. Recall from 
Section 7.1 that 


Z—N(01) > E(Z) = 0, Е(2?) = 1, EZ?) = 3. 
So V(Z?) = 3 — 1? = 2. Because the Z,’s are independent №(0, 1) varia- 
bles, we have 


k k 
EW)-YXEk(-k Ү(И) = У, (22) = 9А. 
i=l i=l 


The parameter k, traditionally called “the degrees of freedom," is simply 
the expectation of the variable W. 


88 8 Sampling: Univariate Case 


Figure 8.1 Chi-square pdf's: k — 1, 2. 


For future reference, we also report that 
E(VW) = 1/(Ё— 9) fork 2, 
E(VW?) = 1/{(k — 2)(k – 4)] fork? 4. 


(2) Student's t-distribution. If Z ~ N(0, 1), W ~ x'(k), with Z and W 
being independent, and U = Z/V(W/k), then the pdf of О is 


ГА + 1y2] 


1+ wk -[e 1/2] 
VE TEDITA) | ae 


ки) = 


We write this situation as U ~ А). This pdf defines the Student’s t- 
distribution, a one-parameter family. The parameter & is again called 
“the degrees of freedom.” The pdf is symmetric, centered at zero, and 
similar in shape to a standard normal pdf. The cdf is tabulated in many 
texts. 

The derivation reverses: if U ~ tk), then U can be expressed as 
ZIV (Wik) = VkZIVW, where Z ~ N(0, 1) is independent of W ~ x*(). 
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ско) 


Figure 8.2 Chi-square pdf's: k = 3, 5, 7, 9. 


So some of the moments сап be calculated conveniently from those of 
the standard normal and chi-square distributions. In particular: 


E(U) = VkE(Z)E(V VW) = VIOE(U VW) = 0 fork > 1, 


V(U) = E(U?) = E(kZ7/W) = kE(Z”)E(1/W) 


= ki(k — 2) fork > 2. 


Observe that E(U) = 0, and that for large k, V(U) = 1. Indeed, for k = 
30, the 4А) distribution is practically indistinguishable from a N(0, 1) 
distribution. More formally, as k — ©, the cdf of the Student’s é-distri- 
bution, namely G,(-), converges to the standard normal cdf F(.). More 
on this in Section 9.6. 


90 8 Sampling: Univariate Case 
8.6. Sampling from a Normal Population 


Now we can fully characterize the joint sampling distribution of the 
sample mean and sample variance, in random sampling from a normal 
population. We have X ~ N(p, o°), and X,, .. . , X, as the random 
sample. 


Standard Normal Population 


First suppose that X ~ (0, 1). For convenience, use Y rather than X 
as the name of the variable. So the random sample is Yi, .. . , Y, 


the sample mean is Y = >,У/п, and the sample variance is V* = 
E(Y, — Yh. 
The joint distribution of Y and V* has these features: 


Fl*. Мв? ~ NO, 1), 
F2*. — nV* ~ y?(n — 1), 
F3*. — Y and V* are independent, 


F4*. — Ví(n — DY/VV* ~ t(n — 1). 


Proof. Evidently all four items will be established iff we can write 
(81) — VnY2Z, nv*=Z724+...4+2%, 


where 21, Zo, ..., Z, are independent N(0, 1) variables. Then F1*—F3* 
follow immediately, while F4* follows by 


Vin — DY/VV* = VnY/VInV*/(n — 1)] = Z/V[nV*/(n — 1)]. 


A more general version of Eq. (8.1) will be established later: see Exercise 
21.1. For now, we cover only the case n = 2. Let 


2, = (Ү, + Y/V2, 2, = (Y, – Yə) V2. 


Since Y, Y; are BVN, it follows that Z;, Z; are BVN. Calculating means, 
variances, and covariances then shows that Z, and Z, are independent 
Х(0, 1) variables. Now Z, = V9Y, and 


Yi — Y = Y, — (Y, + Y.)/2 = (Y, — Y,)/2 
Y; E Y 
so 2V* = 12/9 + 75/9 = Z2. m 


2,%, 
–7,%, 
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General Normal Population 


Now suppose X ~ N(p, o°). The sample mean is X = X,X/n, and the 
sample variance is S? = У(Х, — X)/n. Let Y = (X - p)/o, so X = p+ 
oY, with Y ~ (0, 1). Then 


X=pt+o¥ > Х-ць=о > УХ - plo = У, 
X,-X-o(y;-Y) > У (Х,- Х) = о? У (у, – У), 
nS? = с?пу* > W = nS’ = nV*, 

S-oVV* > 0 = Уп – IX – uys = Vin — DY/VV*. 


Because F1*—F4* hold for the special (0, 1) population, we conclude 
that: 


Fl. X ~ N(p, o?/n), 
F2. W = nS*/o? ~ x*(n — 1), 
F3. X and $? are independent, 


F4. О = М(п — Y(X — py/S ~ t(n — 1). 


These features completely specify the joint distribution of X and S? in 
random sampling from a normal population. 

Here are a few further remarks. 

* Observe the contrast between these two results: 


Z = Vn(X — uyo ~ N(0, 1), 
О = Vin — DX — py/S ~ t(n — 1). 


It is sometimes said that the first gives the distribution of X when o? is 
known, while the second gives the distribution of X when о? is unknown. 
But this, of course, cannot be correct. Actually the first gives the distri- 
bution of a certain linear function of X (whether or not o? is known), 
while the second gives the distribution of a certain function of X and 
S? (whether or not о? is known). The practical distinction is rather that 
the first is usable for inference about p when c? is known, while the 
second, as will be seen in Section 11.5, is usable for that purpose even 
when co? is unknown. 

* Recalling that for large k, the t(k) distribution is virtually indistin- 
guishable from the N(0, 1) distribution, we may, when n is large, treat 
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U as if it were Z. Indeed, we may use the “limiting distribution” for U, 
namely the standard normal, as an approximation for the distribution 
of U, even for moderate n. More on this in Chapter 9. 

* Observe the contrast between the sample variance M, = 5° and 
the ideal sample variance M3. Because M$ = (l/n)Z(X; — w)? is the 
mean of the squares of п independent N(0, o?) variables, we know that 
nMila? ~ x*(n), while we have seen that nM,/o” ~ x?(n — 1). It is 
sometimes said that “one degree of freedom is lost” when X is used in 
place of р, a remark that sounds like punishment for a crime. A less 
dramatic statement is that the expectation is reduced by one. 


Exercises 


8.1 Suppose that Х|, X, are independent drawings from a population 
in which the pdf of X is f(x) = 1 for 0 x x = 1, with f(x) = 0 elsewhere. 
Let T = (X, + X,)/2. Find the pdf of T. 


8.2 Suppose that Y,, Yo, Үз are independent drawings from a popu- 
lation in which the pdf of Y is 


fo) = (2+ yV16 for0 x y = 4, 


with f(y) = 0 elsewhere. Let Z = (Y, + Yo + Y$)/3. Calculate E(Z) and 
V(Z). 


8.3 Consider these alternative populations for a random variable X: 


(a) Bernoulli with parameter p = 0.5. 
(b) Normal with parameters p = 0.5, о? = 0.25. 
(c) Exponential with parameter à = 2. 


Let A be the event {0.4 < X = 0.6}. For each population, find E(X), 
V(X), and Pr(A). 


8.4 Now consider random sampling, sample size 10, from the popu- 
lations in Exercise 8.3. Let X denote the sample mean, and let B be the 
event {0.4 < X < 0.6}. For each population, find E(X), V(X), and Pr(B). 
Comparing these results with those in Exercise 8.3, comment on the 
effect of increasing sample size on the distribution of sample means. 
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8.5 Let X and S? denote the sample mean and sample variance in 
random sampling, sample size 17, from a (10, 102) population. Find 
the probability of each of these events: 


A={X < 14.9} B={5.1<X = 149) C = (S? = 92.04} 
D=BNC Е = {4(Х — 10S = 1.746} Е ={X = 10 + 0.535} 


(Note: For E and F, a Student's t-table will be required.) 


8.6 The random variable X has the exponential distribution with 
parameter A = 2. 


(a) Use the definite integral formula 
Í le” dt = па") fora > 0 and n positive integer, 
0 


to show that 
E(X) = 1/2, E(X*) = 1/2, EQO)-3/4, Е(Х?) = 3/2. 


(b) Consider random sampling, sample size 20, from this population, 
and let T = (X? +... + X29)/20. Calculate E(T) and V(T). 


8.7 Let X denote the sample mean in random sampling, sample size 
n, from a population in which X ~ exponential(A). So E(X) = 1А. 
Consider the sample statistic T = UX. 


(a) Is E(T) greater than, equal to, or less than A? Justify your answer 
by reference to Jensen’s Inequality (Section 3.5). 
(b) Suppose A = 2 and n = 10. Calculate E(T) and V(T). 


8.8 Recall that if X — exponential(A), then in random sampling, 
sample size n, one has W = 2nÀX ~ x (2n). So it must be the case that 
W = Xi. V, where the V/s are independent x*(2) variables. To reconcile 
these two results, show by reference to their pdf’s that the chi-square(2) 
distribution is the same as the exponential(1/2) distribution. 
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9.1. Introduction 


As we have seen in Chapter 8, the probability distribution of the sample 
mean in random sampling depends on the parent population and on 
the sample size. For three specific parent populations, we have reported 
the distribution of the sample mean. For any parent population, we 
have shown that E(X) = р and V(X) = 9?/n. Now we develop additional 
information about the distribution of X that is valid for all parent 
populations. The information concerns asymptotic properties ot the distri- 
bution of the sample mean. 

As n gets large, E(X) stays at р while V(X) = o?/n goes to zero. So it 
is plausible that the distribution of the sample mean becomes degenerate 
at the point р. as n goes to infinity. On the other hand, consider the 
standardized sample mean 


Z = [X - ERVE? = (X — ро) = Vn(X — pio. 


By linear function rules, we see that E(Z) = 0 and V(Z) = 1 for every 
n. So it is plausible that the distribution of the standardized sample 
mean approaches a nondegenerate limit as n goes to infinity. If so, we 
might want to use that limiting distribution to approximate the distri- 
bution of Z even when the sample size is modest. If we do that, we will 
be approximating the distribution of X itself, because (for given п, p, 
G) an event that is defined in terms of X can be translated into an event 
that is defined in terms of Z. 

These remarks will be formalized as follows. In random sampling 
from any population with E(X) = p and V(X) = o’: 
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* Law of Large Numbers. The probability limit of X is р. 
e Central Limit Theorem. The limiting distribution of Z 15 N(0, 1). 
* The asymptotic distribution of X is N(p, o^/n). 


To clarify the situation, consider a set of charts that refer to random 
sampling from the exponential(l) population, a situation where we 
know the exact sampling distribution of X (and hence of Z). Figures 9.1 
and 9.2 show the pdf's and cdf’s of X for n = 5, n = 30, and n = 90. 
Observe how the distribution becomes increasingly concentrated at the 
point р = 1 as n increases. In contrast, Figures 9.3 and 9.4 show the 
pdf's and cdf's of Z for n = 5, n = 30, and n = 90. Observe how the 
distribution becomes stabilized as n increases, taking on the appearance 
of the N(0, 1) distribution. As we shall see, those asymptotic properties 
prevail regardless of the population. 
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Figure 9.1 Sample mean pdf's: exponential(1) population. 


Fal x ) 


Figure 9.2 Sample mean cdf’s: exponential(1) population. 
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Figure 9.3 Standardized sample mean pdf’s: exponential(1) population. 


9.2 Sequences of Sample Statistics 97 


Figure 9.4 Standardized sample mean cdf's: exponential(1) population. 


9.2. Sequences of Sample Statistics 


To proceed systematically, think of a sequence of sample statistics, 
indexed by sample size. For example: X, = sample mean in random 
sampling, sample size 1; X; = sample mean in random sampling, sample 
size 2;...;X, = sample mean in random sampling, sample size n. Each 
of these random variables has its own pdf (or pmf), cdf, expectation, 
variance, and so forth. 

More generally, let T,, be a sequence of random variables, with cdf’s 
С) = Pr(T, = t), expectations E(T,,), and variances V(T,). In what 
follows, T, may refer to the nth variable in the sequence, or to the 
Sequence as a whole. We will use “lim” throughout as shorthand for 
"limit as n > =,” 

We define three types of convergence. 

Convergence in Probability. If there is a constant c such that lim G,(t) = 
0 for all ғ < c and lim G,(t) = 1 for all t = c, then we say that (the 
sequence) T, converges in probability to c, or equivalently that the probability 
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limit of T, is c. We write this as Т, > c, or as plim T, = c. Let A, = 
(IT, — c| = є} where є > 0, so 


Pr(A,) = 1 — G,(c + €) + Pr(T, = c + €) + Gc — €). 


Iff Т, converges in probability to c, then lim Pr(A,) = 1-1+0+0= 
0. So an equivalent way to define convergence in probability of T, to c 
is that 


lim Pr(|T,, — c| 2€) = 0 foralle 0. 


Convergence in Mean Square. If there is some constant c such that 
lim E(T, — с)? = 0, then we say that (the sequence) T, converges in mean 
square to c. Two consequences are immediate: 


Cl. If T, is a sequence of random variables with lim E(T,) = c and 
lim V(T,) = 0, then T, converges in mean square to c. 


Proof. E(T, — с)? = V(T,) + [E(T,) — cf. Take limits. ш 


C2. If T, converges in mean square to c, then T, converges in prob- 
ability to c. 


Proof. Let A, = (|T, — c| = €) where є > 0. Applying Chebyshev 
Inequality #1 (Section 3.5) gives 0 = Pr(A,) = E(T, — c)'/€.. Taking 
limits gives 0 = lim Pr(A,) = 0, whence lim Pr(A,) = 0. m 


Convergence in Distribution. If there is some fixed cdf G(t) such that 
lim G,(t) = G(t) for all ¢ at which G(.) is continuous, then we say that 
(the sequence) Т„ converges in distribution to G(.), or equivalently that the 
limiting distribution of T, is G(.). We write this as T, > G(-). Evidently 
convergence in probability is the special case of convergence in distri- 
bution in which the limiting distribution is degenerate. 


9.3. Asymptotics of the Sample Mean 


We apply these concepts to the sequence of sample means in random 
sampling from any population. 
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LAW OF LARGE NUMBERS, or LLN. In random sampling from 
any population with E(X) = p and V(X) = c^, the sample mean converges 
in probability to the population mean. 


Proof. We have Е(Х,) = p and V(X,) = o°/n, so lim E(X,) = p and 
lim V(X,) = 0. So X, converges in mean square to p by Cl, hence X, 
converges in probability to p by C2. m 


CENTRAL LIMIT THEOREM, or CLT. In random sampling 
from any population with E(X) = p and V(X) = o°, the standardized 
sample mean Z — Vn(X — po converges in distribution to N(0, 1). Equiv- 
alently, Vn(X — p) converges in distribution to N(0, o°). 


Proof. See DeGroot (1975, pp. 227-233). 


. Associated with the CLT is an approximation procedure. The limiting 
cdf of Z, = V/n(X, — )/o will be used to approximate the exact cdf of 
Z, for sample size n. If the cdf of Z, is H,(c*) = Pr(Z, = c*), then we 
will approximate H,(c*) by Ф(с*), where dX.) is the N(0, 1) cdf. This 
procedure uses the limit of a sequence as an approximation to a term 
in the sequence; the error in this approximation is arbitrarily small if n 
is large enough. This is quite analogous to using 1/(1 — b) to approxi- 
mate the finite sum 1 + b + D^ +... + b" in Keynesian multiplier 
analysis (where 0 < b < 1). Of course the approximation may be poor 
when n is small. 

Approximating the cdf of the standardized sample mean Z, by the 
N(0, 1) cdf amounts to approximating the cdf of the sample mean X, 
by the N(p, o?/n) cdf. For, 


F,(c) = Р(Х, = c) = Pr(Vin(X, — plo = Мс — uya] 
Pr(Z, = c*) = H,(c*), 


with c* = V/n(c — uyc. When we use the approximation F,(c) = ®(c*) 
we are in effect treating X, as if it were distributed N(p, о?/п). So we 
say that the asymptotic distribution of X, is Х(и, с?т), and write 


X, А Хр, o7/n). 
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More generally, whenever we have a sample statistic Т„ and parameters 
6 and o? (that do not involve л), such that the standardized statistic 
Vn(T, ~ 8) converges in distribution to the N(0, 1) distribution, we 
will say that Т, is asymptotically distributed N(0, 6°/n), and may refer 
to Ө and $?/n as the asymptotic expectation and asymptotic variance of T,,. 

Observe again the distinction between the limiting distribution of the 
sample mean, which is degenerate at p, and the asymptotic distribution of 
the sample mean, which is №(р, с/т). Clearly the latter provides more 
useful information. 

It is tempting to be cynical about the relevance of asymptotic theory 
to empirical work, where the sample size may be modest. But in fact 
the approximations are typically quite accurate. 


Example. Consider random sampling, sample size 30, on the 
random variable X, where X ~ x*(1). Find Pr(A) where A = (X = 1.16). 
Since E(X) = 1 and V(X) = 2, we have X 4 N(1, 2/30), whence Pr(A) = 
@(c*), where c* = (1.16 — 1)/V (2/30) = 0.62, and 4x.) is the N(0, 1) cdf. 
From the standard normal table, Ф(с*) = 0.73. For the exact calculation, 
rely on the fact that W = nX = £X; ~ x°(30). From the x*(30) table, 
Pr(A) — Pr(W x 34.8) — 0.75. The approximation is very good even 
though the sample size is modest. 


9.4. Asymptotics of Sample Moments 


The asymptotic results for the sample mean must apply directly to the 
entire class of statistics that can be interpreted as sample means in 
random sampling. As in Section 8.4, that class includes the sample raw 
moments M; = Y, where Y = X'. In particular, for the sample second 
raw moment, M; = X,X?/n, we have 


Ms > рь, 
ММ} — ui) V(u4 — us?) > (0, 1), 
М» & N[p5, (m4 -p3 Xn], 


where p, = E(X’) denotes the population rth raw moment. 
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The class also includes the sample moments about the population 
mean M* = Y, where Y = (X — py. In particular, for the ideal sample 
variance М = X(X; — p)*/n, we have 


M$ + bo, 
Vn(Mž — paY V (и — 3) > (0, 1), 
ME * Хра, (Ha — руп], 


where p, = E(X — p) denotes the population rth central moment. 


9.5. Asymptotics of Functions of Sample Moments 


This would be the end of the story except that we are often concerned 
with sample statistics that are not interpretable as sample means in 
random sampling. Typically, the statistics are functions of such sample 
means. So we require rules for getting probability limits and asymptotic 
distributions for functions of sample means. For a linear function, there 
is no problem: if T, = a + БХ, where a and b are constants that do not 
involve n, then T, = Y,, is itself a sample mean in random sampling on 
the variable Y = a + bX, whence 
Ta 0, УТ, -0)6  N(0, 1), Т, * №, фт), 

with 0 = a + bp and ф? = Ps. 

But the sample statistics are not always that simple. For example, we 
may be concerned with T = 1/X, which is a nonlinear function of X. Or 
we may be concerned with the sample variance 


S = È (X; - Хуп = Mt - (X - y. 
which is a function of M¥ and X (nonlinear in the latter). Another 
example is the sample t-ratio 

U = Vn(X — pys, 


which is again a function of M¥ and X (nonlinear in both arguments). 
То derive the asymptotics of functions of sample moments, the key 
tools are the Slutsky Theorems. Here T,, V,, and W, are sequences of 
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random variables, while the functions h( ) and the constants c do not 
involve n. The theorems are: 


S1. If T, > c and h(T,) is continuous at c, then А(Т,) > A(c). 


S2. I£ V, 5 c, and W, > c, and A(V,, W,) is continuous at (сү, со), 
then A(V,, W,) > Һс}, Co). 


S3. If V, 5 c and W, has a limiting distribution, then the limiting 
distribution of (И, + W,) is the same as that of c + W,. 


S4. If V, 5 c and W, has a limiting distribution, then the limiting 
distribution of (V.W,) is the same as that of cW,. 


Theorems S1 and S2 say that the probability limit of a continuous 
function is the function of the probability limits. Further, S3 and S4 
refer to the situation in which one variable has a probability limit and 
another variable has a limiting distribution. In their sum or product, 
the first variable can be treated as a constant as far as limiting distri- 
butions are concerned. For example, if V, > c and №, - (0, o°), then 
(V, + Wp) > N(c, o^) and (V.W,) > (0, o°). 
The other key tool is 


S5. Delta Method. If V/n(T, — 8) > N(0, $?) and О, = A(T,) is contin- 
uously differentiable at 6, then 


Vn[U, — &(8)] > N10, [k (6*7). 


Equivalently, if T, ~ N(0, фт) and U, = МТ„) is continuously differ- 
entiable at 8, then 


О, ^ N(A(0), [^ (0) фт). 


Here again the understanding is that the function А(.) does not involve 
n. What S5 says is that the asymptotic distribution of U, = A(T,) is the 
same as that of its linear approximation at Ө, namely Už = (8) + 
h'(0)(T,, — 9). 

For proofs of S1—S5, see Rao (1978, pp. 122, 124, 385—386). Here is 
an intuitive argument for S1: By continuity, h(T,,) will be confined to a 
neighborhood of A(c) provided that Т, is confined to a neighborhood 
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of c. The probability of the latter event can be made arbitrarily close to 
1 by taking n sufficiently large, so the same is true of the former event. 
And here is some intuition for S5: By the mean value theorem of 
calculus, [U, — (0) = А(Т°)(Т, — 0), where T° lies in the interval 
between Т, and Ө. So Vx[U,, — А(0)] = А (ТМТ, — 9). Because Т, 
converges in probability to Ө, so does 75, and hence by continuity 
h'(T;) converges in probability to &'(8). Then it is not surprising that 
asymptotically, U, behaves like U*. 


9.6. Asymptotics of Some Sample Statistics 

We apply the theory to two sample statistics. 
Sample Variance. Recall that 

(9.1) My = Mt - (X - р) = (Mt, X), 


say, where M, = S? = УХ, — X)'/n is the sample variance, and Мў = 
УХХ, — u)” is the ideal sample variance. By the LLN, 


Р. w P 
M$ > Mo, X и. 


М» = ММ}, X) > Щи», ш) = Be — (и — p)? = ns. 


That is, the probability limit of S? is the same as that of Mf, namely o°. 
Next, rewrite Eq. (9.1) as 


(9.2)  Vn(Ms — us) = Vn(Mž — py) — U’, 
where U = Wn(X — y). By the CLT, 
Vn(M¥ — ро) > N(0, p. — ро). 


By linear function rules, E(U) = /лЕ(Х — pg) = 0, and V(U) = 
VnV(X) = o°/Vn. Because lim E(U) = 0 and lim V(U) = 0, U converges 
in mean square to 0 by Cl, and hence U > 0 by C2. So U? 0 by 51, 
whence by S3 we conclude that 


ММ, ~ po) > N(0, ра — ро). 


That is, the limiting distribution of Vn(S? — a?) is the same as that of 
Vn(M¥ — 0°), namely N(0, p4 — p2). Equivalently, the asymptotic dis- 
tribution of S? is the same as that of М, namely N[o?, (ы — ит]. 
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Sample t-ratio. Let U = Vn(X ~ uyS and Z = У(Х — uyo, so U = 
(a/S)Z. Now Z 2» N(0, 1) by the CLT, and o/S > 1 (using S? > o°, and 
51). So by 54, U 2» N(0, 1). That is, the limiting distribution of U is the 
same as that of Z, namely N(0, 1). The same conclusion will follow if 
one uses V(n — 1) instead of Vn in defining the sample /-ratio, as is 
sometimes done. 


Exercises 


9.1 Discuss Exercise 8.4 in the light of the asymptotic theory of this 
chapter. 


9.2 In random sampling, sample size n = 20, on the variable X where 
X ~ exponential(2), let T = X; Хп. In Exercise 8.6, you found E(T) 
and V(T). Use the CLT to approximate the probability that T is less 
than or equal to 1. 


9.3 As in Exercise 8.7, let X denote the sample mean in random 
sampling, sample size п, from a population in which the random variable 
X ~ exponential(A). For convenience, let @ = E(X) = 1/A. So E(X) = Ө, 
V(X) = 0, plim X = 9, and the limiting distribution of Vn(X — 9) is 
N(0, 9°). Consider the sample statistic U = 1/X. 


(a) Use a Slutsky theorem to show that plim U = А. 

(b) Use the Delta method to find the limiting distribution of 
Уу — X. 

(c) Use your result to approximate Pr(U = 5/2) in random sampling, 
sample size 16, from an exponential population with A = 2. 

(d) Find the exact Pr(U x 5/2). 


9.4 The probabihty distribution of the random variable X is given by 
Pr(X = 1) = 1/3, Pr(X = 2) = 2/3. In random sampling sample size n, 
let T = Z?.,X/n. For n = 98, approximate Pr(5 = T = 6). 


9.5 In a population, the random variable X — length of unemploy- 
ment (in months) has the exponential distribution with parameter А = 2. 
Consider random sampling, sample size n = 21. Let T = proportion of 
the sampled persons who have been unemployed between 0.4158 and 
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1 months. Approximate the probability that T lies between 0.4 and 0.5. 
Hint: Define the random variable 


з 1 if0.4158sxX sl, 
0 otherwise. 


9.6 In a certain population the random variable Y has variance equal 
to 490. Two independent random samples, each of size 20, are drawn. 
The first sample mean is used as the predictor of the second sample 
mean. 


(a) Calculate the expectation, expected square, and variance, of the 
prediction error. 

(b) Approximate the probability that the prediction error is less than 
14 in absolute value. 


10 | Sampling Distributions: Bivariate Case 


10.1. Introduction 


Having acquired considerable information about the distributions of 
sample statistics in random sampling from a univariate population, we 
proceed to bivariate populations. Consider a bivariate population in 
which the pmf or pdf of (X, У) is f(x, y). The first and second moments 
include 


E(X) = px, EY) = py, 
V(X) = ох, VY) = ої, С(Х, У) = oxy. 

And for nonnegative integers (ғ, s), the raw and central moments are 
Е(ХҮ) = ш, E(X*Y*) = р 


where X* = X — py, Y* = Y — цу. For example, po = ox (formerly 
called нә), Mos = су, and Каа = бху. 

A random sample of size n from this population consists of n inde- 
pendent drawings: the random vectors (X; Y) for i = 1, ..., n are 
independently and identically distributed. Of course X; and Y; need not 
be independent; indeed C(X;, Y;) = Oxy. Independence runs across the 
n observations, not within each observation. The joint pmf or pdf of the 
random sample is 


PAEST Jis Хо, Jos DELE хь Yn) = П fix, уд). 


Sample staustics are functions of the random sample. They include not 
only the single-variable statistics X, Y, SÈ, 52, but also the joint statistics 
that involve both components of the random vector (X, Y). A leading 
example is the sample covariance 
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Sxy = (1/n) 2 (X; — X), - F). 


We may also be concerned with the joint distribution of several statistics. 
For example, the two sample means 


X=(/n)È X, Ү= (1) ÈY, 


are a pair of random variables that have a joint sampling distribution. 
We already have their expectations and variances. We can calculate their 
covariance by T5, the linear function rule (Section 5.1), extended to n 
variables: 


C(X, Y) = (1/n)) X >, C(X, Y,) = (1?) È C(X, Y;) 
t h i 
= (1/n*)no y, = Ox,/n, 


using the independence and identically-distributed features of random 
sampling. This result on the covariance of two sample means is quite 
analogous to the result on the variance of a single sample mean, V(X) = 
o s/n. 

All previous conclusions for the univariate case apply to the single- 
variable statistics, but we have additional conclusions as well. We confine 
attention to general results that are applicable regardless of the form 
of the population. 


10.2. Sample Covariance 
The theory for the sampling distribution of the sample covariance, 
Syy = (In) 2; (X; - XY, - Ў) = My, 


say, runs quite parallel to that for the sample variance. 


Ideal Sample Covariance 


Consider first the ideal sample covariance, namely the sample second joint 
moment about the population means: 


Мї = (т) 2 (X; — Bx)(¥i — By) = (1%) > V,=V, 
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say, where V = X*Y*, so V; = X*Y*. Now V,,..., V, isa random sample 
on the variable V = X*Y*, so Mi, = Visa sample mean in random 
sampling on the variable V. Hence the earlier theory for sample means 
applies. 


The population mean and variance of V are 
E(V) = E(X*Y*) = C(X, Y) = Oxy = Bi, 
V(V) = E(V*) — E°(V) = E(X*Y*) — E (X*Y*) = jas — wih. 
So we have the exact results 
Е(Мў) = E(V) = EV) = ш = Oxy, 
V(M¥,) = V(V) = V(Vyn = (pos — pin, 
and also the asymptotic results 
Mi; > Ка 
Vn(M¥, = pa) 2> N(0, pos — м1), 
Mt, è Nimi (M22 — pim]. 


Sample Covariance 


Now turn to the sample covariance itself, namely the sample second 
joint moment about the sample means. As a matter of algebra, 


(10.1) Sxyy = М = М – X = p — py). 
So 
E(Sxy) = Oxy — CX, Y) = Oxy — сууп = (1 — Un)oxy, 
which is quite parallel to E(S?) = (1 — Vn)o?. And direct calculation 
gives 
V($xy) = (n — 1) (hoo — pin? + 2(n — 1)(роног)/т?, 


which is quite parallel to the exact result for V(S?) in Section 8.4. As for 
asymptotics, we obtain 


Sxy 5 Oxy, Vn(Sxy — Oxy) > N(0, poo — pu) 
Proof. First, in Eq. (10.1), M}, > oxy, (X — px) 0, and (Y — py) 


+> 0, so Syy converges in probability to oxy, using Slutsky Theorem S2 
(Section 9.5) twice. Next, rewrite Eq. (10.1) as 
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002)  Vn(M, — pi) = ММ — рл) — UW, 
where 
U-Va(X-yu) W= Val¥ - py). 


Since U > 0 and W + 0, we have (UW) -> 0 by S2. So by S3 the 
limiting distribution of Vn(Syy — Oxy) is the same as the limiting distri- 
bution of Vin(Mf, — ші). E 


We conclude that the asymptotic distribution of the sample covariance 
is the same as that of the ideal sample covariance: 


Sxy 5 Ж{оху, (Boe — pim], 


which is quite parallel to the asymptotic result for the sample variance 
in Section 9.6. 


10.3. Pair of Sample Means 


Now turn to the joint distribution of the pair of sample means, X and 
Y. We proceed to asymptotics. For a random vector, convergence in 
probability means that each component of the vector converges in prob- 
ability, and convergence in distribution means that the sequence of joint 
cdf’s has as its limit some fixed joint cdf. For convenience we drop the 
subscript “з” that identifies a sequence. The key theorems are: 


BIVARIATELAW OF LARGE NUMBERS. In random sampling 
from any bivariate population, the sample mean vector (X, Y) converges 
in probability to the population mean vector (M, у). 


BIVARIATE CENTRAL LIMIT THEOREM. In random sam- 
pling from any bivariate population, the standardized sample mean 
vector, [Vn(X — uxyoy, МУ — џу)/6], converges in distribution to 
the SBVN(p) distribution, where р = ocy,/(0y9,). Equivalently, in 
random sampling from any bivariate population, (X, Y) 4 N(px, py, 
oin, с т, сху/п). 


These theorems apply directly, of course, to any pair of sample 
moments that can be interpreted as a pair of sample means in random 
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sampling. But again we need tools for deriving asymptotics for functions 
of sample means. The Slutsky Theorems 51—54 extend in the obvious 
way, while S5 generalizes to the 


BIVARIATE DELTA METHOD. If (T, Т) © BVN(8,, 8», Ọ?/n, 
фп, фут) and U = h(T,, To) is continuously differentiable at the point 
(01, 05), then U * N[A(8,, 05), 4?/n], where 


ф? = hid? + һәфә + Qhyho jo, 
hi = h(i, 82) = ah(T,, Т) ӘТ evaluated at T, = 9), Tz = Ө», 
hs = һә(Ө |, 05) = ӘТ, T3)/01, evaluated at ТҮ = 0;, Т» = 05. 


In other words, the asymptotic distribution of U = h(T,, T2) is the same 
as that of its linear approximation at the point (6), 9), namely 


U* = h(0,, 05) + 5, (064, G(T, — 91)  h5(0,, 069) (T5 — 05). 


The understanding is that the function h(-, -) does not involve n. 


10.4. Ratio of Sample Means 


To illustrate the application of this theory, consider the ratio of sample 
means, T = X/Y, with the proviso that ру # 0. The asymptotic joint 
distribution of X and Y is given by the Bivariate CLT, and the analysis 
for T starts with 


T = XF = h(X,Y), Мих, by) = ху = 9, 
say, where h(., -) is the ratio function. We recognize that 
E(T) = E(XIY) = EKE) = ру, = 9. 


But X > ux and Y 4 wy, by the LLN (Section 9.3), so that T > 6 
by S2. 
Proceeding, we calculate 


hX, Y) = ӘМӘХ = VY, hi(px, My) = Ш, 
AX, Y) = ahoY = —X/¥”, (р My) = р/р = —6/py. 
So the Bivariate Delta method gives 


T ~ N(0, o?/n), 
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where $?/n is the (asymptotic) variance of the linear approximation to 
T at the point (px, py), namely 


T* = 0 + (Шы) — px) — (#/м,)(Ў — py) 
8 + (1) = px) — e — py)]. 


li 


That is, 
(103) 4$? 


(Мр) кў + 0202 — 200x). 


10.5. Sample Slope 


In Section 5.4, we introduced the population linear projection (BLP) of 
Y on X in a bivariate distribution, namely the line E*(Y|X) = a + BX, 
with 

В = суу/су, a = py — Bp 


The corresponding feature in a sample is the sample linear projection (or 
sample LP) of Y on X, namely the line P =A + BX, with 


В = 8,15% = ММ, А = Y ~ ВХ. 


To further illustrate application of our theory, we seek the asymptotic 
distribution of the statistic B, the sample slope, in random sampling from 
any bivariate population. 


Ideal Sample Slope 


We first treat a simpler statistic, the ratio of the corresponding sample 
moments taken about the population means, В* = M7,/M%), which we 
refer to as the ideal sample slope. Now 


Mtf, = (ln) E X#¥8 = V, М = (In) X XP =W, 
say, where V = X*Y*, W = X, X* = X — py, Y* = Y — py. So 
B* = M¥,/Mx = VIW 


is a ratio of sample means in random sampling on (V, W), and the 
machinery of Section 10.4 applies directly. 
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Because p, = £(X*Y*) = oy, and pw = Е(Х*?) = o2, we have 
Шуру = В, so 
B* > суу/о? = B. 
And similarly, we have 
B* A N(B, фт), 
with 
(10.4) "= (Vew) [oy + Bow — 2Boywl. 


It remains to express the variances and covariance of V and W in terms 
of the parameters of the bivariate distribution of X and Y. Calculate 


oy = WV) = E(V)) — E°(V) 
= E(X*Y*) — B(X#Y*) = ias — рц, 
o2, = V(W) = E(W?) — ENW) 
= E(X**) — EXX*) = pao — Mo, 
Syw = C(V, W) = E(VW) — E(V)EQV) 
= E(X€Y*) — E(X*YSE(X*) = рз — раро. 
So in Eq. (10.4) the term in square brackets can be written as 
Boo — Bir + В (ao — B20) — 2B(si — роо) 
= Was + В?рао ~ 2Виз, 
using роо = ү. Thus 


(10.5) ф? = (рә + Во = 2B 151) ие. 


Sample Slope 


Now return to В = M,,/Mgo, the sample slope itself. We know that 
My, È pa, and Ma, > роо, so it follows immediately by S2 that 


B > Wii/H2o = B. 
Next write 


B — В = (B* — В) + (B ~ В»). 
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As a matter of algebra, 
B – B* = (Mo) [M — М) — (Mii/Maog)(Mso — M39), 

50 

Vn(B — B*) = (!М ГУМ, — Мф) – (Mu/Ms) Vin(Mao — ME). 
Since М > роо, (Mii/M o) > V1i/ P0; VnM: — МҰ) — 0, and 
Vin(Msg — M&) © 0, it follows that Vn(B — B*) -> 0. So by 53 the 
limiting distribution of Ул(В — B) is the same as that of Vn(B* — В). 
Equivalently, the asymptotic distribution of B is the same as that of B*. 
We conclude that 


B & N(B, ф?^), 


with ф? as given in Eq. (10.5). 


10.6. Variance of Sample Slope 
Because the sample slope is very commonly used to measure the sample 
relation between two variables, we should learn more about its sampling 


distribution. In Eq. (10.5), the denominator of $? is n3, = У(Х), while 
the numerator can be written as 


Hoo + В2рао — 2Bps; = E(X*Y*)) + g?e(X**) — 2BEQC? Y*) 
= E[X*(Y* — Bx*y] = E(X* U^), 


say, where 
U = Y* — BX* = (Y — py) — BX — px) = Y — (a + BX) 
is the deviation from the population BLP. So 


(10.6) ф? = E(X*®U?)/V7(X). 
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A special case arises if E(U^|X) is constant over X. Let с? denote that 
constant value of E(U?|X), and calculate 


E(X* U’) 


Ex[E(X*U?|X)] = Ex[X®?E(U’|X)] = Е(Х*?о?) 
c?E(X*)) = 0° V(X). 


So in this special case, ф? = o?/ox, which may look familiar to those 
who have studied "linear regression analysis." 

Perhaps the only situation that ensures the constancy of E(U*|X) is a 
population in which the conditional expectation function E(Y|X) is 
linear in X (so that U = Y — E(Y|X), with E(U?|X) = V(Y|X)), and the 
conditional variance function V(Y|X) is constant over X. (We know one 
population, the bivariate normal, that has those features.) In this situ- 
ation we can say that, for given sample size n, the asymptotic variance 
of the sample slope will be large if the conditional variance of Y is large 
and/or the marginal variance of X is small. 


Exercises 


10.1 Given a data set with n paired observations (x; y;), let e; = y; — 
(а + bx;), where a and b are constants to be chosen to minimize X;e?. 


(a) Show that the solution values are 
)»-2|Zze-99-»|l/ze-»|] s«-5-5 


(b) Referring to Section 10.5, show that the sample linear projection 
$ = A+ Bx has this least-squares property. 


10.2 These statistics were obtained in a sample of 30 observations 
from a bivariate population f(x, y): Ex; = 30, Ey, = 120, Ex; = 150, 
Ey = 1830, 5;х;у; = 480. Here 2; denotes summation over i from 1 
to 30. 


(a) Calculate the sample means, sample variances, and sample covar- 
iance. 
(b) Find ў = a + bx, the sample linear projection of y on x. 
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10.3 For the savings rate-income data of Chapter 1: 
(a) Calculate the sample linear projection of y = savings rate on x = 
income. Hint: The sample covariance can be calculated as 
10 
Syy = > [xmyy, P(x;)] — туту. 
ie 
(b) Comment on the relation between this line and the conditional 
mean function that was plotted in Figure 1.1. 


10.4 A bivariate population f(x, y) has these moments: 


E(X) = 1 E(Y) = 4 E(X*Y*) = 360 
Е(Х*?) = 120 E(Y*) = 1350 E(X*¥**) = 0 
E(X*)) = 10 Е(Ү*%) = —40 E(X*?Y*) = 320 


E(X**) = 14880 = E(Y¥**) = 12000 — E(X*”¥*”) = 1596000 
Here X* = X – E(X), Y* = Y - E(Y). 


(a) Find the BLP, E*(Y|X) = a + BX. 

(b) Consider random sampling, sample size 30, from this population. 
Let X be the sample mean of X, S% be the sample variance of X, 
and B be the slope in the sample linear projection of Y on X. 
Approximate the probability of each of these events: 


A, = {X = 3/2}, А, = {52 = 198), A, = {В = 5}. 


11 Parameter Estimation 


11.1. Introduction 


We have accumulated considerable information about the probability 
distributions of sample statistics in random sampling from univariate 
and bivariate populations. In some cases we have the complete exact 
sampling distribution, in some cases only its expectation and variance. 
In many cases, we have the asymptotic sampling distribution. The infor- 
mation was obtained by deducing features of the distributions of func- 
tions of random variables (the sample statistics) from knowledge of the 
distribution of the original variables (in the population). 

The practical problem is, of course, quite the opposite: it calls for 
inferring (or guessing, or estimating) features of the population from 
knowledge of a single sample. This problem is not trivial, because the 
same sample might arise in sampling from many different populations. 
We are turning from deduction to inference. 

We will have a single random sample (yj, .. . , Yn)! drawn from an 
unknown population Y — f(y). We are interested in some feature of the 
population, a parameter Ө, say. Our task is to find an estimate of 0, a 
single number that will serve as our guess of the value of the parameter. 
Naturally the estimate will be a function of the sample data. How shall 
we process the sample data to come up with an estimate? That is, what 


function (у, .. . , Yn) shall we choose as our estimate? 

Now the sample y = (Jp . . ., Jn)’ is a single observation on the 
random vector Y = (Ү,,..., У,)'. For a function A(y), the estimate that 
we calculate, t= h(y,,..., Yn) will bea single observation on the random 


variable T = h(Y). The random variable T is referred to as the estimator, 
as distinguished from the value ¢ that it happens to take on, which is 
the estimate. 
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11.2. The Analogy Principle 


Perhaps the most natural rule for selecting an estimator is the analogy 
principle. A population parameter is a feature of the population. To 
estimate it, use the corresponding feature of the sample. 

Here are some applications of this principie, that is, examples of 
analog estimators. 

* 'To estimate a population moment, use the corresponding sample 
moment. For 0 = p, use Т = Y. For 0 = o°, use T = $Ê. 

* 'To estimate a function of population moments, use the function of 
the sample moments. For the population BLP slope В = oxy/ox, use 
В = Syy/Sx. For the population BLP intercept а = py — Вих, use A = 
Y — ВХ. For o°/n, use S?/n. This sort of application of the analogy 
principle is also referred to as the method of moments. 

* To estimate Pr(Y < c), use the sample proportion of observations 
that have Y = c. 

* To estimate the population median, use the sample median. 

* To estimate the population maximum, use the sample maximum. 

* To estimate the population BLP E*(Y|X) = a + BX (the line that 
minimizes expected squared deviations in the population), use the 
sample least-squares line P = A + BX (the line that minimizes the mean 
of squared deviations in the sample). As shown in Exercise 10.1, this 
gives the same answers for A and B as above. 

* To estimate a population СЕЕ рух, use the sample conditional mean 
function тух, discussed in Chapter 1. 

The analogy principle is constructive as well as natural. Once we 
decide which feature of the population is of interest to us, we will almost 
inevitably recognize an estimator for it. Adopting the analogy principle 
as the starting point in a search for estimators leads immediately to 
some questions. Are analog estimators sensible from a statistical point 
of view? How reliable are they? What shall we do when an analog 
estimator is unreliable, or inadequate? What shall we do when there are 
several analog estimators of the same parameter? (For example, if the 
population f(x) is symmetric, then the population mean and median 
coincide, but the sample mean and median are distinct.) For a compre- 
hensive development of the theory, see Manski (1988). 

To address such questions here, we turn to the classical criteria for 


evaluating estimators. We will begin with criteria that refer to exact 
sampling distributions. | 
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11.3. Criteria for an Estimator 


Let T = ЖҮ, ..., Y,) be a sample statistic with a pdf (or pmf) 2(2), 
and moments Е(Т), V(T), etc. We may be sampling from a univariate 
population (with each Y; a scalar) or from a bivariate population (with 
each Y; a two-element vector). 

Choosing an estimator T amounts to choosing a sampling distribution 
from which to get a single draw. So the issue becomes: what probability 
distribution would we like to draw from? When we choose an estimator, 
we are buying a lottery ticket; in which lottery would we prefer to 
participate? Presume that the prize in the lottery is higher the closer T 
is to 9. We would like T = 6, but that ideal is unattainable. The only 
sample statistics with degenerate distributions are the trivial ones, those 
for which h(Y) = constant. It is easy enough to pick such a function, 
for example, h(Y) = 3, but that is hardly an attractive estimator unless 
9 = 3. What we ask is that T be close to 0, whatever the value of 0 might 
be. 

A natural measure of distance between the random variable T and 
the parameter Ө is the mean squared error (MSE), E(T — 6)”. We used the 
MSE measure in discussing prediction of a random variable in Section 
3.4. Now the target is a fixed parameter rather than a random variable, 
but again it seems desirable to have a small value for E(T — 90). 
According to T3 (Section 3.3), the MSE of T about 0 can be written as 


E(T – 0)? = V(T) + (E(T) — Ө]. 


Define the bias of T as an estimator of 0 as E(T) — 0 — E(T — 0). So the 
MSE of T as an estimator of 9 equals the variance of T plus the squared 
bias of T as an estimator of 0. In general, both variance and bias depend 
on the unknown parameter 0, and it is not feasible to find a T that 
minimizes E(T — 0)? for all Ө. Still, small MSE is desirable. 

It also seems desirable to have an estimator for which the expected 
deviation from the parameter is zero: 


DEFINITION. Т is an unbiased estimator of Ө iff E(T — 0) = 0, for 
all 9. 


For unbiased estimators, MSE = variance, which leads to a popular 
criterion, namely minimum variance unbiasedness: 


11.3 Criteria for an Estimator 119 


DEFINITION. Т isa minimum variance unbiased estimator, or MVUE, 
of 0 iff 


(i) E(T — 0) = 0 for all 0, and 
(ii) V(T) € V(T*) for all T* such that E(T* — 0) = 0. 


No other member of the class of unbiased estimators of 0 has a variance 
that is smaller than the variance of Т. 

The MVUE criterion may be operational even when minimizing MSE 
is not. At least, it may be operational if we restrict the class of estimators. 


Estimation of Population Mean 


Suppose that we are random sampling on the variable Y, where E(Y) — 
y. and V(Y) = o? are unknown. To estimate the population mean yp, the 
analogy principle suggests that we use the sample mean Y. Now 


Y-(m)EY, EY) = Е(Ү) = ы. 


So Y is a linear unbiased estimator of p. But there are many other linear 
unbiased estimators of р. Let T = ZjcjY;, where the c’s are constants. 
Then 


ЕТ) = E(Za¥,) = аву) = «£s. 


VT) = Levy) = о? Xd. 
So any linear function of the Y/s with intercept equal to 0 and sum of 


slopes equal to 1 will be unbiased for p. To find the best of these, choose 
the c/s to minimize Х,с? subject to X,c; = 1. Let 


Q-EX4-A(Za- 1), 

where А is a Lagrangean multiplier. Then 
9Q/àc; = 2с; — X @=1,...,n) 
aQ/arx = — S 6 — 1). 


a 


Setting the derivatives at zero gives c; = А2 for all +, and 2c; = 1. So 
Zc; = nd/2, whence А = 2/n, whence с; = 1/n for all i. It can be confirmed 
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that the first-order conditions locate a minimum, so the optimal choice 
is T = Z(ln)Y, = Y. 
We have established the following: 


THEOREM. In random sampling, sample size n, from any popula- 
tion, the sample mean is the minimum variance linear unbiased. estimator, 
or MVLUE, of the population mean. 


This result is strong in that it applies to any population, but weak in 
that only linear functions of the observations are considered. 

Let us evaluate some other analog estimators with respect to the 
criteria introduced here. 

Population Raw Moment. For Ө = р; = E(Y^), the analogy principle 
suggests M; = (1/n)2,Y; as the estimator. Since this is the sample mean 
in random sampling on the variable W = Y', we conclude that М; is the 
MVLUE of w, where “linear” now means linear in the Y*'s. Similarly, 
in the bivariate case, M}, is the MVLUE of pi, where "linear" now 
means linear in the X'Y"'s. 

Population Variance. For Ө = po = E(Y — py)? = o? (with py unknown), 
the analogy principle suggests М» = $°, the sample variance, as the 
estimator. As we have seen (Section 8.4), S? cannot be interpreted as a 
sample mean in random sampling. And indeed, this analog estimator 
is biased: E(S*) = a?(1 — 1/n). But the bias is easily removed. Define 
the adjusted sample variance 


5% = У (Y, — Рп — 1) = nS"/(n — 1). 


Then E(S**) = о?, so 5* is unbiased. But we have no general result 
here on minimum variance unbiasedness. 

Population Covariance. For estimating the population covariance, the 
adjusted sample covariance S£, = nSxy/(n — 1) is unbiased, but not 
necessarily minimum variance unbiased. 

Population Maximum. For Ө = max(Y), the analogy principle suggests 
T = max(Y, ... , Y,) as the estimator. But T = 6 and Pr(T = 0) < 1, 
so E(T) « 0. This analog estimator is biased, and there is no obvious 
way to remove the bias. 

Population Linear Projection Slope. For B = Oxy/ox (with рх and py 
unknown), the analogy principle suggests В = $уу/$% as the estimator. 
This is a nonlinear function of sample moments. We can adjust the 
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numerator and denominator to remove their biases, but this will not 
remove bias in the ratio, because the expectation of a ratio is not gen- 
erally equal to the ratio of expectations. Indeed, this analog estimator 
will in general be biased, and there is no obvious way to remove the 
bias. (An important exception occurs when the population CEF is linear: 
see Section 13.2.) 


11.4. Asymptotic Criteria 


As we have seen, evaluating analog estimators on the basis of their exact 
expectation and variance may run into an impasse, for the exact distri- 
bution may well depend on specifics of the population. Progress is 
available if we rely on asymptotic, that is, approximate, sampling distri- 
butions. We put an n subscript on T, as in Chapter 9, to emphasize the 
dependence upon sample size, and introduce two classical criteria. 


DEFINITION. T, is a consistent estimator of 0 iff T, > Ө. Equiva- 
lently, Т, is a consistent estimator of Ө iff lim Pr(|T,, — 0| = €) = 0 for 
all e > 0. 


Consistency is attractive because it says that as the sample size increases 
indefinitely, the distribution of the estimator becomes entirely concen- 
trated at the parameter value. 

The sample mean is a consistent estimator of the population mean in 
random sampling. The Law of Large Numbers says precisely that, 
taking Ө = р, Т = Y. By the same law, any sample raw moment is a 
consistent estimator of the corresponding population raw moment in 
random sampling. Further, by the Slutsky Theorems S1 and S2, any 
continuous function of the sample moments is, in random sampling, a 
consistent estimator of the corresponding function of the population 
moments. For example, S? = My = МЎ — (X — uy? = ММ}, X) isa 
consistent estimator of a? = po = Мн», р), and В = S,,/S2 is a consistent 
estimator of B = ox,/o s. 

There is a general presumption that analog estimators are consistent 
in random sampling. The intuition runs as follows. The analog estimator 
is a function of the frequency distribution in the sample. The parameter 
is. the same function of the probability distribution. The sample fre- 
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quencies will, as n gets large, converge to the population frequencies, 
that is, to the probability distribution. 

Typically there are many consistent estimators of the same parameter. 
For example, in sampling from a symmetric population, both the sample 
mean and the sample median are consistent estimators of p = E(Y). То 
discriminate among consistent estimators of the same parameter, we 
turn from degenerate limiting distributions to asymptotic distributions. 
Typically the consistent estimators will have asymptotic normal distri- 
butions, centered at the parameter value. 


DEFINITION. 7, is a best asymptotically normal, or BAN, estimator of 
9 iff 


(i) T, ^ N(0, $?/n), and 
(i) $? = ф* for all T* such that T* 4 N(0, $*?/n). 


No other member of the class of consistent, asymptotically normal esti- 
mators of Ө has an asymptotic variance that is smaller than the asymp- 
totic variance of T,,. 

In effect, the BAN criterion is the asymptotic version of the MVUE 
criterion. Throughout econometrics, except for those situations in which 
exact results are available, best asymptotic normality, sometimes labeled 
asymptotic efficiency, is the customary criterion of choice. 


11.5. Confidence Intervals 


In reporting an estimate of a parameter, it is a good idea to accompany 
it with some information about the reliability of the estimator, or rather 
about its unreliability—the extent to which it varies from sample to 
sample. The natural measure is the standard deviation of the estimator. 
Often that information is presented in the form of a confidence interval. 

Suppose that Y ~ (p, o°). Then in random sampling, sample size n, 
the sample mean Y is distributed N(p, o7/n), so Z = VnË — yo ~ 
N(O, 1). Let 


A = {|Z| = 1.96}, 


so Pr(A) = F(1.96) — F(- 1.96) = 0.975 — 0.025 = 0.95, where F(-) is 
the N(0, 1) cdf. Now 
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A = [Ӯ — u| = 1.960/Vn} 
= {ш — 1.960/Vn € Y < y + 1.960/Vn} 
= (Y — 1.960/Vn = y x Y + 1.960/Vn}. 


So the statement, “The parameter y lies in the interval Yu 1.960/Vn,” 
is true with probability 95%. We say that Y + 1.960/Vn is a 95% 
confidence interval for the parameter p. 

To get a confidence interval for p at a different confidence level, just 
use a different cutoff point: for example, Y + 1.6450/Vn is a 90% 
confidence interval for p. For a given confidence level, a narrow con- 
fidence interval is desirable: it indicates that the sample has said a lot 
about the parameter value. What produces a narrow interval, obviously, 
is a small standard deviation, o/Vn, that is, a small o? and/or a large n. 

Now suppose that the random variable Y is distributed (not necessarily 
normally) with E(Y) = w and V(Y) = o°. Then in random samp- 
ling, sample size n, the sample mean Y is asymptotically distributed 
N(p, o?^/n). So, by the logic above, Pr(A) = 0.95, and we say that yx 
1.960/V/n is an approximate 95% confidence interval for the parameter w. 

In practice these results will not be operational, because o° is 
unknown. Consider the sample t-ratio U — МУ — uyS, where S? is 
the sample variance. Recall from Section 9.6 that U > N(0, 1). So by 
the logic above, Y + 1.965/V/n is an approximate 95% confidence 
interval for the parameter p. The statistic S/Yn is called the standard 
error of Y, as distinguished from its standard deviation, ol Vn. 

The logic extends to construct approximate confidence intervals for 
parameters other than the sample mean. Suppose that 7, is a sample 
statistic used to estimate a parameter Ө, and that T, ~ N(6, b/n). If d 
is known, then T, + 1.966/ Vn provides an approximate 95% confidence 
interval for the parameter 0. More practically, when we have e, a 
consistent estimator of ó?, then Т, + 1.966/Vn provides an approximate 
95% confidence interval for the parameter Ө. 


Example. The sample variance S? is used to estimate the popu- 
lation variance g°. Recall from Section 9.6 that S? 4 N(o”, $?/n), with 
p? = pao — pio. So let 6? = Мао — M,, and report S? + 1.966/Vn as 
the approximate 95% confidence interval for o”. 


The reasoning is the same as that used above to get Y + 1.965/V/n as 
an approximate 95% confidence interval for jv. The statistic b/Vn is 
called the (asymptotic) standard error of T,- 


124 11 Parameter Estimation 


A few remarks on confidence intervals: 

* Perhaps the only proper exception to using the 1.96 rule for 9596 
confidence arises when Y is normally distributed. For then we know 
(Section 8.6) that the exact distribution of Vin — 1)(¥ — pS is 
i(n — 1), so the Student's ¢-table can be consulted for the critical value 
to replace 1.96. There may be other exceptional cases in which the exact 
distribution of the sample i-ratio is known. But, common practice not- 
withstanding, there is no good reason to rely routinely on a t-table 
rather than a normal table unless Y itself is normally distributed. 

* It is good practice to report the standard error of a parameter 
estimate along with the estimate itself. Conventionally, the standard 
error is put in parentheses underneath the estimate. Readers can then 
construct (approximate) confidence intervals as they see fit. 

* It is common practice, but not good practice, to report the “t-statistic” 
(ratio of an estimate to its standard error) instead of (or even in addition 
to) the standard error. More on this in Section 21.3. 


Exercises 


11.1 In random sampling, sample size n, from a univariate popula- 
tion, let T = cY, where Y is the sample mean. 


(a) Choose c to minimize the MSE of Т as an estimator of р = E(Y). 
(b) Comment on the practical usefulness of your result. 


11.2 In size-n random sampling from a bivariate population f(x, y), 
suppose that the objective is to estimate the parameter 0 = py ~ px. 
For example, the population may consist of married couples, with Y — 
husband's earnings and X — wife's earnings. The sample statistics X, Y, 
55, S2, and Sxy are available. 


(a) Propose a statistic T that is an unbiased estimator of 8. Show that 
it is unbiased. 

(b) Find its variance V(T) in terms of the population variances and 
covariance of X and Y. 

(c) For the practical case, in which those population variances and 
covariance are unknown, propose an unbiased estimator of V(T). 
Show that it is unbiased. 
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(d) What statistic would you report in practice as a standard error 
for T? 


11.3 In one population, E(Y,) = p and V(¥,) = тї; in a second 
population, E(Y3) = ш and Ү(У,) = oj. The population variances 
are known, but their common expectation p is unknown. Random 
samples of sizes n, and n, respectively are drawn from the two popu- 
lations. The two samples are independent. It is proposed to combine 
the sample means Y, and У, linearly into a single estimator of the 
common mean р. 


(a) Consider all possible linear combinations of Y; and Y;. Determine 
the one that is minimum variance unbiased as an estimator of р. 

(b) Verify that the variance of that estimator is less than the variance 
of each of the two sample means. 


11.4 You are interested in estimating 0 = p, — (i, where Y, ~ 
N(w,, 50) and У, ~ (ps, 100). You can afford a total of 100 observa- 
tions. Determine how many you should draw on Y, and how many 
on Ys. 


11.5 Suppose that Y, = X + U, and Y, = X + Us, where X = 
permanent income, Y, = current income in year 1, and Y, = current 
income in year 2. It is known that U, and 07 have zero expectations 
and are uncorrelated with X. It is also known that V(X) = 400, V(U,) = 
200, V(Uz) = 100, and C(U;, Uz) = 60. A random sample of size 10 is 
drawn from the joint probability distribution of Y, and Ys. The objective 
is to estimate E(X), which is unknown. The sample means are Y, and 
Ys. Consider all linear combinations of the sample means that are 
unbiased estimators of E(X), and find the one that has minimum vari- 
ance. 


11.6 A random sample from a population has п = 30, x; = 120, 
Xa? = 8310. 


(a) Calculate unbiased estimates of the population mean, the popu- 
lation variance, and the variance of the sample mean. 

(b) Provide an approximate 95% confidence interval for the popu- 
lation mean. 


11.7 A random sample from a Bernoulli population has 35 observa- 
tions with Y = 1, and 65 observations with Y = 0. 
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(a) Calculate unbiased estimates of the population mean, the popu- 
lation variance, and the variance of the sample mean. 

(b) Provide an approximate 95% confidence interval for the popu- 
lation mean. 


11.8A random sample from an exponential population with 
unknown parameter à has n = 50, 2x; = 30. 


(a) Calculate an unbiased estimate of the population mean. Is that 
estimate consistent? Explain briefly. 

(b) Calculate a consistent estimate of the parameter X. Is that estimate 
unbiased? Explain briefly. 

(c) Provide an approximate 95% confidence interval for А. 


11.9 'These statistics were calculated in a random sample of size 100 
from the joint distribution of X and Y: 


X=2 Si-5, Y=], S$=4, Sw=3. 
Construct an approximate 95% confidence interval for the parameter 
8 = E(XyE(Y). 


11.10 We are interested in estimating the proportion of the popula- 
tion whose incomes are below the poverty line, a prespecified level of 
income. Let Y = income and c = poverty line, so the parameter of 
interest is 0 = Pr(Y < c) = G(c), where G(-) is the unknown cdf of 
income. For random sampling, sample size n, from the population, the 
analogy principle suggests that we estimate 0 by T — proportion of the 
sample observations having Y < c. 


(а) Find E(T) and V(T). Is T unbiased? Is T consistent? Explain. 
(b) Show that T * N([0, 6(1 — 9y/n]. 


11.11 For the setup in Exercise 11.10, suppose now that it is known 
that Y is normally distributed, with known variance but unknown mean. 
So 6 = G(c) = F[(c — w/o], where F(.) is the standard normal cdf and 
с is known. Because Ө is a function of the population moment p, the 
analogy principle suggests an alternative estimator of 0, namely U = 
F[(c — Yy/c], where Y is the sample mean in random sampling, sample 
size n. 


(a) Show that U is consistent. Is it unbiased? Explain. 
(b) Find the asymptotic distribution of U. 
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(c) On the basis of their asymptotic distributions, which estimator of 
Ө would you prefer to use, T or U? Hint: Two useful facts about 
the standard normal pdf and cdf, f(z) and F(z), are 


dF (z)/dz = f(z), 
[ДӘРЛДЕ(Ә[П1 — F(z)) < 0.64 for all values of z. 


12 Advanced Estimation Theory 


12.1. The Score Variable 


Our discussion of parameter estimation has been general with respect 
to the population: we have not assumed knowledge of the family to 
which the population belongs. Now we turn to estimation in more 
completely specified situations, where the family, that is, the form, of . 
the pdf or pmf is known up to a parameter of interest. The value of 
that parameter is then a missing link needed to complete the specifica- 
tion of the population. 

Suppose that the random variable Y has pmf or pdf f(y; 9), where the 
function is known except for the parameter value 0. Define the log- 
likelihood variable 


L = log f(Y; 9) = L(Y; 0), 
and the score variable 
Z = à log f(Y; 9/90 = 91/90 = z(Y; Ө). 


We write Y rather than y as the argument to emphasize that both L and 
Z, being functions of the random variable Y, are themselves random 
variables. The score variable plays several roles in the theory. First we 
establish: 


ZERO EXPECTED SCORE (or ZES) RULE. The expected value 
of the score variable is zero. 


Proof. For convenience treat the continuous case. Since Z is a function 
of Y, its expectation is 
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EZ) = | цу; 00:9 = | of 


(Note: In this chapter, f is shorthand for Д7” ш, and the arguments of 
functions may be omitted for convenience.) Because f(y; 8) is a pdf, it 
must be true that 


| fo: 9)dy = 1 for all Ө. 
Differentiating both sides with respect to Ө gives 
| (97790) dy = 0. 


(Note: Here апа subsequently it is assumed that the range of integration 
does not depend on 9.) But 


aflað = (8 log f/00)f = z(y; Ө = zf, 
so f ау = 0. m 


12.2. Cramér-Rao Inequality 


One role of the score variable is to set a standard for unbiased estimation 
of Ө in random sampling. We show: 


CRAMÉR-RAO INEQUALITY, or CRI. In random sampling, 
sample size n, from an f(y; 9) population, if T = A(¥;,..., Y,) and 
E(T) = 0 for all 6, then V(T) = V[nV(Z)]. 


Proof. First, consider the case n = 1. Here T = A(Y) with E(T) = 0 
for all 0. That is, 


| Му; 9) dy = @ for all 6. 
Differentiating both sides with respect to 0 gives 


| Му) dy = 1, 


which says that E(TZ) = 1. Because E(Z) = 0, it follows that C(T, Z) = 
1. Recall the Cauchy-Schwarz Inequality (Section 6.6), which says that 
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squared correlations cannot exceed unity: it must be the case that 
V(T)V(Z) = С(Т, Z). So here V(T)V(Z) = 1, ог V(T) = l/V(Z), which is 
the CRI for n — 1. 

Proceed to the case n > 1, Let g = g,(yi, . . - , Jn) denote the joint 
pdf of ihe random sample. Here Т = A(Y,,..., Ya) with E(T) = 0 for 
all 6. That is, 


IBID ET 


Differentiating both sides with respect to Ө gives 


IESE (2/90) dy, -- - dy, = 1. 
But 
dg/a@ = (8 log 2/90), g= П fis 9). 


so log g = 2; log f(y; 9) = È; log f;, say. Thus 
à log 2/90 = У (ð log f/00) = У z(y; 0) = È z; = nZ, 


say, where Z is the sample mean of the score variable. So 
Jmm = 1, 


which says that nE(TZ) = 1. Because E(Z) = E(Z) = 0, it follows that 
nC(T,Z) = 1, so C(T,Z) = l/n. By the Cauchy-Schwarz Inequality, 
V(T)V(Z) = Vr?, so V(T) = Iin? vZ]. But VZ) = V(Zym, so V(T) = 
V[nV(Z) m 


The CRI does not provide us with an estimator, but rather sets a 
standard against which unbiased estimators can be assessed. If we 
happen to know, or have located, an unbiased estimator T with V(T) = 
l/[nV(Z)], then we can stop searching for a better (that is, lower-vari- 
ance) unbiased estimator, because the CRI tells us that there is no 7* 
such that E(T*) = 0 and V(T*) « V(T). 


Example. Suppose that Y — Bernoulli(0), so its pmf is 
Қу; 8) = e(1— 977? fory = 0, 1. 
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As a random variable 

ДҮ; Ө) = e'ü — 0), 
So 

L = log f = Y log 0 + (1 — Y)log(1 — Ө), 

Z = aL/80 = ҮІӨ — (1 — Yy(1 — 0) = (Y — 69)[0(1 — Ө)]. 
We know (see Table 3.1) that E(Y) = Ө and V(Y) = 0(1 — 0). So 

E(Z) = E(Y — 6)/[0(1 — 8)] = 0, 

V(Z) = V(YV[01 — 6)? = ea — eyrecr — 6f = 1/(0(1 — 6). 
The expectation illustrates the ZES rule. The variance formula implies 
by the CRI that if T is an unbiased estimator of 0, then 

V(T) = V[nV(Z)] = 0(1 — 8)n. 


The sample mean Y has E(Y) = Ө and V(Y) = V(Y)/n = 0(1 — 0)n = 
V/[nV(Z)]. It follows that Y is the MVUE of 0 in random sampling from 
a Bernoulli population. This conclusion is considerably stronger than 
the previous general result that Y is MVLUE (see Section 11.3), for now 
the class of estimators considered is no longer confined to linear func- 
tions of the observations. 


For the normal distribution, as well as for the Bernoulli, the sample 
mean is the MVUE of the population mean. 

There is another way to state the CRI. Recall that Z = à log //80, with 
E(Z) = f z(y; 8)f(y; 9) dy = 0. Define the information variable 


W = —àZ/00 = —9 log f/00?. 


This too is a random variable. We show: 


INFORMATION RULE. The expectation of the information var- 
iable is equal to the variance of the score variable. 


Proof. Differentiate E(Z) = f zf dy = 0 with respect to 6: 


| (9:790) dy = 0. 


Now 
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A(zf)/a8 = z(af/a8) + (82/00)f = z(8 log f/o0)f — wf 


= 2р uf = (Ê – wf. 


So f (22 — w)f dy = 0. That is, EZ? — W) = 0, so E(Z*) = E(W). But 
EZ) = 0, so V(Z) = EZ) = E(W). m 


Example. For the Bernoulli distribution we have seen that the 
score variable, Z = (Y — 0y[8(1 — 6)], has V(Z) = 1/[6(1 — 8)]. The 
information variable is W = —dZ/a8 = (1 + Z — 20Zy[6e(1 — 8)]. With 
E(Z) = 0, we see that E(W) = V[0(1 — 0)]. 


With E(W) = V(Z), we can restate the CRI conclusion as V(T) = 
1/[nE(W)]. This restatement is useful because for some distributions, 
E(W) is easier to calculate than V(Z). It also accounts for the label 
“information variable”: the larger the expected information variable is, 
the more precise the unbiased estimation of a parameter may be. 


12.3. ZES-Rule Estimation 


A second role of the score variable is to provide an estimator of 0. Recall 
au analogy that suggested the sample mean as an estimator of the 
population mean (Section 11.2). An instructive way to restate that 
analogy is as follows. Because E(Y — p) = 0, we can characterize Н. as 
the value for c that makes E(Y — c) — 0. Now the sample analog of the 
population average E(Y — с) is the sample average (1/n)Z(Y; — c), so let 
us estimate p by the value for c that makes (1/n)Z(Y; — с) = 0. The 
result is с = (V/n)Z;Y; = Y. 

Witb that in mind, we will use the ZES rule to obtain an estimator of 
9. Suppose that we are drawing a random sample from a population in 
which the pdf or pmf ДУ; 0) is known except for the value of the 
parameter 0. The score variable is Z = z(Y; Ө). Because E(Z) = 0, we 
can characterize Ө as the value for c that makes E[z(Y; c)] = 0. Now, the 
sample analog of E[z(Y; c)] is (1/n)Z;z(Y;; c), so let us estimate 0 by the 
value for c that makes (1/n)Ziz(Y;; с) = 0, or equivalently that makes 
Zà(Y; c) = 0. Changing notation somewhat, let T be the solution value, 
and let 7, = 2(Ү,; T). Then by construction, the ZES-rule estimator T 
satisfies ,Z, = 0. 
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Example. Suppose that Y ~ exponential(A), so ДУ; №) = 
à exp(-AY). Then L = log А — AY, and Z = (1%) — Y. Let Z, = (ШТ) — 
Y; Then X = = (n/T) — BY; = n[(l/T) — Y]. Setting this at zero gives 
Т = l/Y as the ZES-rule estimator of A. 


We sketch a derivation of the asymptotic distribution of ZES-rule 
estimators. Using a linear approximation at the point 9, and recalling 
the definition of the information variable W = —dZ/a0, write 


2, = «Y; T) = ҖҮ;; 0) + [дд(У,; OVT — 0) = Z; — WAT — Ө), 
where W; = —8Z/80. So 
X = 5 Z- (T - 8) È W, 


By construction, z, — 0, whence 


T- 0 = ZZ/ZW,- ZW, 


where Z = (1/n)Z,Z; and W = (1/n)Z,W,. We may neglect the approxi- 
mation error. Then, because Z and W are sample means in random 
sampling on the variables Z and W, the problem amounts to finding the 
asymptotic distribution of a ratio of sample means, as in Section 10.4. 
By the LLN, Z > E(Z) and W > E(W) = V(Z), so that (T — 0) > 0. 
We conclude that T, which may or may not be unbiased, is a consistent 
estimator of Ө. Further, we have 


МТ — 9) = VnZ/W = (1AW)VnZ. 


By the CLT and LLN, we know that VnZ > N[0, V(Z)] and W > E(W), 
so by 54, 


Vn(T — 9) > NO, $’), 
with ф? = V(ZY[E*(W)] = 1/V(Z). Equivalently, 
T & МӨ, 1/[nV(Z)]}. 

Observe that the asymptotic variance of T is at the lower bound for 
unbiased estimation of 8, which is an attractive property. Indeed there 
is an asymptotic version of the Cramér-Rao Inequality that says that the 
asymptotic variance of a consistent estimator cannot be less than the 


CRI lower bound. So the ZES-rule estimator is a BAN estimator. In this 
- sense, of all the analogies to draw on for an estimator of 0, the ZES 
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rule is the best. Of course, to use it, we must have knowledge of the 
form of the population pdf or pmf. 


Example. Continue the preceding exponential example. Recall 
(Section 8.3) that W = ААР ~ x°(k) with k = 2n. (Caution: Do not confuse 
this W with the information variable.) Using the results for E(1/W) and 
E(VW?) reported in Section 8.5, we find 


E(T) = nM(n— 1), WT) = nN liin — D*(n- 2). 


While E(T) > А and V(T) > A?/n, we see for large n that E(T) = А and 
also V(T) = Ат, which is the CRI lower bound. 


12.4. Maximum Likelihood Estimation 


There is another approach, which is better known, that produces ZES- 
rule estimation. 

Consider a population in which the random variable Y has pdf or 
pmf f(y; 89), with the function f known but the parameter 0 unknown. 
Under random sampling, sample size n, the joint pdf for the sample is 


Eg 358) = H fos 9). 


We are accustomed to reading this as a function of у, ..., y, for given 
9, but mathematically it can also be read as a function of 0 for given 
Jp + + +> Js When that is done we refer to it as the likelihood function 
for Ө: 


g£ = LO; y, tep ‚ Эһ) = [A UP ык. Yn? Ө) = H fos 9). 


The maximum likelihood, or ML, estimator of 9 is the value for 0 that 
maximizes the sample likelihood function £. Now to maximize £ we 
may as well maximize its logarithm, 


log £ = 2 log Куг; 9) = XL, 


where L; = log f(y; 0). Differentiating with respect to 0 gives 
à log 2/90 = У, 21.190 = У Z; = У, ХҮ, 0). 


і 
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Setting this at zero gives Z;z(y;; T) = 0, which remains to be solved for 
the ML estimator T. Observe that this first-order condition (FOC) is 
precisely the equation for ZES-rule estimation. 


Example. Suppose У ~ Bernoulli(0), so Z = (Y — 0y[6(1 — 9)]. 
Then 
ZZ = È (ү, – 9) — 9). 


The FOC chooses T to make 
2E2,-X(-TWr(ü-T)- 0, 


that is, to make УХУ; — T) = 0. So Х,У; = nT, whence T = Y. It can be 
confirmed that this locates a maximum. So the ML estimator of the 
Bernoulli parameter 0 is the sample mean. 


We have seen how the ML principle, or, for that matter, the analogy 
principle applied to the ZES rule, constructively provides an estimator 
for an unknown parameter when the population family is known. 

Indeed, there is another analogy that leads directly to maximizing 
the logarithm of the likelihood function. In the population, 0 can be 
characterized as the value for c that maximizes the expectation of the 
log-likelihood variable L — log f(Y; c). The argument runs as follows. 
Let 


D(c) = log fly; с) — log f(y; 9) = log [f(y; cy/f(»; 9)]. 


Because logarithm is a convex function, we see by Jensen’s Inequality 
(Section 3.5) that 


E[D(c)] = log ЕГУ; oft; 9] 
with equality if c = Ө. But 


EL f(y; Ку; 9] = | Uy; fly; 91: 9) dy = Í foody-l 


using the fact that f(y; c) is a pdf. So E[D(c)] = log(1) = 0, with equality 
if c = Ө. This says that 6 is the value for c that maximizes the population 
mean log-likelihood variable. As we have seen, the ML estimator has 
the corresponding property in the sample: the ML estimator Т is the 
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value for c that maximizes the sample sum (hence the sample mean) of 
the log-likelihood variable. 

An advantage of this alternative analogy is that it resolves a choice 
that arises when the FOC has multiple solutions, as may happen in 
nonlinear cases. The choice is resolved by taking the solution that glob- 
ally maximizes the sample mean log-likelihood variable. 

We restate the properties of ML estimators. Consider random sam- 
pling from a population whose score variable is Z = à log f(Y; 9)/00 and 
whose information variable is W = —0Z/90. 


If T is the ML estimator of 9, then: 


T > Ө, 
МТ — 0) 2» N(0, $^), where b = 1/V(Z) = VE(W), 
T ^ N(0, фт), 
T is a BAN estimator of 0. 
A convenient property of ML estimation is invariance: If a = МӨ) is 
a monotonic function of 0, and Т is the ML estimator of Ө, then A = 
A(T) is the ML estimator of а. Examples: When Y is the ML estimator 


of p, then (provided that р # 0) LF is the ML estimator of 1/4; when 
S? is the ML estimator of o^, then S is the ML estimator of o. 


Exercises 


12.1 The random variable X has the power distribution on the interval 
[0, 1]. That is, the pdf of X is 


f(x; 9) = 0х0 for0 =x <1, 


with f(x; Ө) = 0 elsewhere. The parameter Ө is unknown. Consider 
random sampling, sample size n. 


(a) Show that the maximum likelihood estimator of 0 is Т = 1, 
where Y = —log X. (As usual, “log” denotes natural logarithm.) 
(b) Find the asymptotic distribution of T, in terms of 0 and n only. 
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12.2 The random variable Y has the exponential distribution with 
parameter A. 


(a) Recalling (from Table 3.1) what is known about E(Y) and V(Y), 
calculate E(Z), V(Z), and E(W). Do your results satisfy the rule 
E(W) = V(Z)? 

(b) Complete the sentence: In random sampling, sample size n, from 
this population, for Т to be an unbiased estimator of X, its variance 
must be greater than or equal to 


12.3 The random variable Y has the Poisson distribution with param- 
eter À. 


(a) Explain why the sample mean and the sample variance are distinct 
analog estimators of А. 
(b) Determine which of them is the ZES-rule estimator. 


12.4 Consider random sampling, sample size n, from the exponen- 
tial(^) distribution. Let T = l/Y and T* = (n — 1)T/n. 


(a) Find E(T), V(T), E(T*), and V(T*). 
(b) Comment on your results in the light of the CRI. 
(c) Find the MSE of T and T* as estimators of А. 
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13.1. Introduction 


We are now prepared to address systematically the questions raised in 
Chapter 1 about estimating the relation between Y and X in a bivariate 
population. 

By way of review, first consider estimating a population mean in 
random sampling from a univariate population. We have learned that 
the sample mean Y is an unbiased and consistent estimator of the 
population mean p, and that Y. © N(p, G^ /n). At least two analogies 
lead to Y as the estimator of p. First, the population mean is the best 
constant predictor of Y in the population: p is the value for c that 
minimizes E(U?), where U = Y — c. The sample mean has the analogous 
property in the sample: Y is the value for c that minimizes Z,uj/n, where 
u; = у — c. Second, y. is the value for с that makes E(U) = 0 in the 
population. The sample mean has the analogous property in the sample: 
Y is the value for c that makes Z;u/n = 0. 

Next consider a bivariate population, in which the population linear 
projection is E*(Y|X) = « + BX, with 


В = Oxy!Ox, a = py — Вых. 


In random sampling, consider the sample linear projection, Ў = A + 
BX, with 


В = 5,152, А = Ү — ВХ. 


We have learned (Sections 10.5 and 10.6) that the sample slope В 
consistently estimates 8, and that 
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B ^ N(B, фт), 


where à? = E(X**U^J/V(X) and U = Y — E*(Y|X). A similar result holds 
for the intercept А. At least two analogies lead to the line Ê = A + BX 
as the estimator of E*(Y|X) = a + BX. First, the population linear 
projection is the best linear predictor of Y given X in the population: it 
minimizes E(U?), where now U = Y — (a + 6X). The sample linear 
projection has the analogous property in the sample: it minimizes 
Xn, where now и; = y; — (a + bxj). This may be referred to as the 
least-squares analogy. Second, the population LP is the line such that 
E(U) = 0 and E(XU) = 0 in the population. The sample LP has the 
analogous properties in the sample: it makes Z,u//(n = 0 and Zix;u/n = 
0. This may be referred to as the instrumental-variable, or orthogonality, 
analogy. 

With this background, we can proceed to estimation of the conditional 
expectation function E(Y|X). We will suppose that the functional form 
of the СЕЕ is known; that is, E(Y|X) = A(X; Ө), where the function 
h(X; Ө) is known up to the values of one or more parameters, the ele- 
ment(s) of the vector Ө. 


13.2. Estimating a Linear CEF 


Suppose that the population CEF is known to be linear. Then the CEF 
coincides with the BLP: that is, E(Y|X) = a + ВХ, with 8 = ox,/o and 
a = py — Вых. The two analogies apply, so the natural estimator is 
again the sample LP, namely Y = A + BX, for which the asymptotic 
distribution is as above. 

In fact, when E(Y|X ) is linear, A and B are unbiased estimators of a 
and В. We show 


THEOREM. In random sampling from a population in which 
E(Y|X) = a + QX, the sample intercept A and slope B are unbiased 
estimators of a and p. 


This is a surprising result because A and B are nonlinear functions of 
the sample moments. 
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Proof. Begin with some algebra: 
S& = (l/n) > (Х, – Xy = > Хт — X? = (1/n) py (X; — X)X,, 
Sxy = (ln) x (X; - XXY,- Y) = x X;Y/n — XY 
= (l/n) > (X, - XY, 


В = Syl? = X Ox — X) У (К, - XY 


= lx, ea / X (X, - xy x, 
- > WiY; 
say, where the random variables 
W; = &,-X3/X X-X?  (G-L...,mn) 
are functions only of the X/s. As a matter of algebra, 
= W,- 0, У WX = 1, У и = [x « - x]. 
Now condition on a given set of observations on the X's, that is, condition 


on X = x = (x, ...,x,)'. Conditional on X = x, the values of the И, 
are constants, which we write as 


== -3/ X 0-5 (2 = 1,..., п). 
= 1 
The ехресіапоп of the slope B conditional on x is 
/ 
Е(В|х) = E(Z wx) = X Ev bo = X oo 
= У шо + Вх) а Уш + ВУ шк = В, 


because Zw; = 0 and Хнох; = 1 by the algebra above. We have 
E(B|x) = В for all x, so B is mean-independent of X, and E(B) = В. 
Similarly, for the intercept A = Y — BX, we have 


E(A|x) = E(Y|x) — E(B|xy = (a + Вх) - Вх = a. 
We have E(A|x) = а for all x, so E(A)= a. u 
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There is a subtlety in the derivation, namely the step that equates 
E(Y;|x) to E(Ylx) = o + Bx; This step is justified by the random 
sampling assumption. To clarify what is involved, it suffices to show 
why E(Y,|xi, x2) = E(Y,|xi). Consider the conditional pdf of Y, given 
both x, and xs: 


giles, хә) = Ќур xi, X2)/fo(x1, x2). 


Independence across the observations implies that the trivariate density 
in the numerator factors into 


fo: Xy, хә) = gJ xi) fixe). 


and, in conjunction with “identically distributed,” implies that the bivar- 
iate density in the denominator factors into 


IECIT X9) = fie fio). 
So 


gil хә) = gil x fix) = &2(91|%1)- 


When two distributions are the same, their expectations are the same. 
That is, 


E(Yi|xi, x2) = E(Yilxi)- 


This calculation extends to further conditioning, and of course toi = 
2,...,m. Thus the step from E(Y;|x;) = a + Bx; to E(Y;|x) = a + Вх, 
is justified under random sampling. Observe how linearity of the CEF 
is crucial to the argument. 

As for the variance of B, under random sampling from a population 
with linear CEF, we have: 


vil = V (F wx) = X avr) = X Үл, 


and V(B) will equal the expectation (over all x) of those conditional 
variances, using the Analysis of Variance formula (T10, Section 5.2), 
with Vx[E(B| X)] = 0. 

A particularly sharp result is obtained if Y is variance-independent of 
X, that is, if the conditional variance function V(Y|X) is constant. This 


will be referred to as the homoskedastic case. If V(Y|X) = o°, say, for all 
X, then | Ес 
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V(B|x) = È шо? = о? Бш; = о? / У (к; – XY = (о”/п)(1/зу), 


where s = Ух; — x)^/n is the sample variance of X. We conclude that 
V(B) = Ex[(V(BIX)] = (0? /n)E(1/S3). 


For large n, this exact variance is approximately (с/т) (1/02), which is 
indeed the asymptotic variance of B found in Section 10.6. 

In this homoskedastic case, with a? and E(1/Sz) unknown, we get a 
standard error for B by taking the square root of V(B) = S?/(nS*), with 
S? = ¥,e?/n and e; = Y; — A — BX;. It can be shown that S? and 1/52 
are consistent for с? and 1/02. 


13.3. Estimating a Nonlinear CEF 


The preceding theory for estimating linear CEF's applies also to some 
nonlinear CEF's. For example, if E(Y|X) = « + BX’, then the theory 
surely applies to E(Y|Z) = a + BZ, where Z = X°. Similarly if E(Y|X) = 
a + В/Х, then the theory applies to E(Y|Z) = a + BZ, with Z = UX. 
What is critical, it now appears, is that the CEF be linear in the unknown 
parameters a, В. 

But suppose that the population CEF is nonlinear in unknown param- 
eters. For example, suppose that we know E(Y|X) = exp(8, + 95X), with 
9, and 9, unknown. How shall we estimate 0, and 09? 

Again we appeal to the least-squares and instrumental-variable anal- 
ogies. 

(1) The CEF is the best predictor. In particular, it is the best predictor 
of the form A(X; с, c9) = exp(c; + csX). In the population, 0, and 9, 
are the values for c, and cy that minimize E(U?), where 


U = Y — ехр(с + eX) = Y — A(X; с, сә). 
So in the sample, let 
U; = у; — exp(cy + cox), 


and choose c, c; to minimize (1/n) Za or, equivalently, to minimize the 
criterion ф = ó(c;, со) = Z,u?. The derivatives are 


аф/дсу = 2 X, ы(ди!/дсу) = —2 X uhi, 
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дф/дс› = 2 У, u,(dujldcg) = —2 X, ujhix,, 


where h; = h(x;; с, со). So the FOC's are 


> h;u; = 0, > Ьхщ; = 0. 


On the proviso that these locate a minimum, we have a pair of nonlinear 
equations to be solved for the nonlinear least squares, or NLLS, estimators 
of Ө}, Ө». 

(2) The deviations from the CEF have zero expectation and zero 
covariance with X. That is, let U = Y — E(Y|X); then E(U) = 0 and 
E(XU) = 0. So let us choose as estimates of 0,, 02, the values of c, с» 
that make 2,u,;/n = 0 and X;x;u//n = 0. Equivalently, we choose them to 
satisfy 


Ўшщ= 0 Zxw-0. 


і 


This is а pair of nonlinear equations to be solved for the instrumental- 
variable, or LV, estimators of 0,, Qo. 

In the linear СЕЕ case, where A(X; с, сә) = c, + СХ, the two analogies 
produce the same estimators, because du,/éc; = —1 and du;/dcg = —x;. 
Further, in the linear CEF case we haye explicit solutions. In the non- 
linear CEF case, NLLS and IV estimators do not coincide, and further 
we will need to rely on numerical solutions. It is not hard to show that 
both our analog estimators are consistent (though not unbiased), and 
to obtain their asymptotic distributions: the derivation is similar to that 
used for the ZES-rule estimator (Section 12.3). Which analog estimator 
is preferable may depend on the population family. If the conditional 
distributions of Y given X are normal with constant variance (that is, if 
U ~ N(0, o?) independently of X), and the marginal distribution of X 
does not contain the parameters Ө, Өз, c^, then it is easy to verify that 
the NLLS estimators are also ML, and hence BAN. 

Observe that NLLS estimation can itself be viewed as a type of IV 
estimation. The NLLS ЕОС 2,h;u,/n = 0 and Zjhx;u/n = 0 аге the 
sample analogs of E[h(X)U] = 0 and E[g(X)U] = 0, where g(X) = h(X)X. 
Since deviations from a CEF have zero expected cross-product with 
every function of X (Section 5.3), such sample analogs are legitimate. 
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13.4. Estimating a Binary Response Model 


To illustrate the opportunities that arise when more is specified about 
the population, we take up a leading example of a nonlinear CEF, 
namely a binary response model, more specifically the probit model. Here 
Y is a binary variable, one that takes on only the values 0 and 1, and 


E(Y|X) = F(0, + 0X), 


where F(-) is the standard normal cdf. Section 6.3 contains a story that 
leads to this model. 

We consider three estimators for the probit model: nonlinear least 
squares, instrumental variables, and maximum likelihood. 

For NLLS, one minimizes the criterion ф = ó(c, сә) = X,u2, where 


u; = Ji ES Fi, F; = F(v;), ©; = Cy + CoX;. 


Let f(-) denote the standard normal pdf, so f; = f(v;) = ðF;/ðv;. The 
derivatives are 


ddb/dc, = 2 2 udu;/dc,) = —2 > uf. 
à$/àc, = 2 Y, u(8u/ócs) = —2 X ufix;. 


So the FOC's are 
X fu = 0, У fou; mE 0. 


On the proviso that these locate a minimum, we have a pair of nonlinear 
equations to be solved for the NLLS estimators of 0, and 0,. 

For IV estimation, we seek the values of су, c that make Z;uj/n = 0 

and Zx;u//n = 0. Equivalently, we choose them to satisfy 

Xu = 0, X xu; = 0. 
This is a different pair of nonlinear equations to be solved, for the IV 
estimators of Ө}, 4. 

Maximum-likelihood, or ZES-rule, estimation is available because the 
probit model automatically specifies the form of the conditional dis- 
tribution of Y given X. Because Y is a binary variable with E(Y|X) = 
F(0, + 02X), it is clear that conditional on X, the variable Y has a 
Parnanlli distribution with parameter F(0, + 02X). Because the param- 
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eters 0, and 05 do not appear in the marginal distribution for X, to 
maximize the likelihood it suffices to maximize the conditional likeli- 
hood. Adapting an example in Section 12.2, we see that the conditional 
log-likelihood variable is 


L; = y; log F; + (1 — y) log(1 —F;). 
With two parameters being estimated, there are two score variables: 


(y/F fi - 101 XX — FO 


wy, — ЕЁ) = шт, 


Z,; = Ә1.,/90, 


I 


Zo; = 91,190 = w;x(y; — Е) = wjxiu, 


say, where w; = f;/[F;(1 — F;)]. Both have expectation zero. So we choose 
c, and cs to satisfy 


5 шщ = 0, X w;x;u; = 0, 
2 z 


which are yet another pair of nonlinear equations to be solved, for the 
ML estimators of Ө,, Ө». 

The three estimators are distinct and will differ in any sample. All 
three are consistent (though not unbiased) for 0; and Ө,. For the probit 
model, one can resolve the choice among the estimators by appealing 
to the BAN property of ML estimation. More on all this in Section 29.5. 


13.5. Other Sampling Schemes 


Thus far, we have confined attention to random sampling. But other 
sampling schemes may be relevant in practice. We explore the possibil- 
ities for estimation of the population relation between Y and X, when 
the observations are not randomly drawn from the bivariate population 


fes у) = ge(y|x)fi09. 


Selective Sampling 


Suppose that the sampling is explicitly selective on X alone in the sense 
that the probability that a particular (X, Y) draw will be retained depends 
only on X. Let Y(X) = probability of retention as a function of X. Then 
the relevant marginal pdf of X is no longer f,(x) but rather 
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ftc) = чуб) / еол ds 
(Note: In this chapter the symbol f is shorthand for f^...) 


Example. For studying the relauon of savings y to income x, we 
might be oversampling high-income households by using (х) = 0.5 for 
x = d, ф(х) = 1.0 for x > d, where d is a prespecified level of income. 


By assumption, gs(y|x) is not affected by this selective sampling scheme, 
so the new joint pdf is 


fF (x, у) = gol y|x)fF(). 


If the successive observations are independent, then we are in effect 
randomly sampling (X, Y) from a new, selected, population. Because 
the conditional pdf go(y|x) has not changed, neither has Һе СЕЕ E(Y|X), 
so the theory of the preceding sections applies. Because the two CEF's 
are the same, estimators of the new CEF will serve as estimators of the 
original CEF. 

This argument does not carry over to BLP estimation (unless the CEF 
is linear). The explicit-on-X selection produces implicit-on-Y selection. 
The marginal pdf of Y changes from f;(y) to 


fs) = [69 & | Бол dx 


Presumably the marginal expectations and variances of both variables, 
and their covariance, are different in the selected population. If so, the 
BLP E*(Y|X) is presumably different. Another way to see this is to recall 
(T13, Section 5.5) the best linear approximation property of the BLP: 
it minimizes E(W?), where W = E(Y|X) — (a + 6X), and the expectation 
is taken over the marginal distribution of X. Because (х) differs from 
fi(X), one should presume that the values of a and b that minimize 
JS w'ft(x) dx differ from those that minimize f а?у (х) dx. If the two 
BLP's are different, then the sample LP, which will consistently estimate 
the BLP of the new population, will not serve for the original BLP. 

This negative conclusion also applies to the "reverse" CEF, namely 
E(X|Y). The new conditional pdf for X given Y, namely 


gt(xly) = fe у) 0), 
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presumably differs from gi(x|y). So the new E(X|Y) presumably differs 
from the original one, and the results obtained in sampling from the 
selected population are inappropriate for estimating the original 
E(X|Y). 

We conclude that explicit selection on X alone affects both of the 
BLP's, and E(X|Y) as well; these effects are sometimes labeled selection 
bias. But it does not affect E(Y| X)—so random sampling from the orig- 
inal joint probability distribution is not needed when a CEF is the target. 


Varying Marginals 


Another departure from random sampling arises if the marginal dis- 
tribution of X changes from observation to observation. There is no 
longer a single bivariate population from which the sample is drawn. If 
the observations are independent, the joint pdf of the sample x’s 
becomes 


gx... Xp) = II fis), 


where the f,,(-) functions vary over i. Provided that the conditional pdf 
gx(y|x) is the same at all observations, then the СЕЕ E(Y|X) will remain 
the same at each observation. If so, then least squares remains appro- 
priate, and is indeed unbiased if the CEF is linear. 

Since there is no longer a single bivariate population, best linear 
prediction in the population is not well-defined, unless one uses 
Ух), say, as a marginal pmf for X. To assess asymptotics, one needs 
further specification: how do the f;,-)’s develop as п grows? 


Nonstochastic Explanatory Variable 


An extreme special case of the varying-marginals scheme arises if at 
each observation the marginal distribution of X is degenerate, that is to 
say, X,,..., X, are constants (not all equal to each other, of course). In 
the econometric literature, this is known as the nonstochastic (or non- 
random, or fixed) explanatory variable case. Another description is strat- 
ified sampling, with the values of X defining the strata, or subpopulations. 
If the CEF is linear, and the observations are independent, then least 
Squares estimation is unbiased. To verify this, return to Section 13.2, 
and utilize the fact that there is only one possible value of the vector 
X = (x, ..., x,) , so the conditioning can be suppressed. 
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There is no longer a single bivariate population, but a population 
BLP might be defined by using the empirical frequency distribution of 
the n values x; as a marginal pmf for X. To consider asymptotic prop- 
erties, some supplementary information is needed: how does the x; series 
develop as п grows? 

It is sometimes said that the nonstochastic explanatory variable case 
requires that the researcher “controls,” “sets,” or “manipulates” the 
values of the conditioning variable. This is ambiguous or misleading. 
For example, if X denotes gender, a researcher may decide to collect a 
sample consisting of 50 men, followed by 25 women. If so, the sample 
values of X are nonstochastic as required, but the researcher has not 
controlled, set, or manipulated the gender of any individual in the 
population. 


Exercises 


13.1 Consider a random sample of size n from the joint distribution 
of (X, Y), where Y|X is Bernoulli with E(Y|X) —F(0X), with F(-) being 
the N(0, 1) cdf. Determine whether the following statement is true or 
false: The ZES-rule estimator of 6 is the value for c that satisfies 
У); — F(ex)]] = 0. 


13.2 Consider a random sample of size n from the joint distribution 
of (X, Y), where Y|X is exponential with parameter № = 1/ехр(ӨХ). 
Determine whether the following statement is true or false: The 
maximum likelihood estimator of Ө is the value c that minimizes 
Zl – ехр(сх,)]?. 


13.3 In Exercise 5.8, we introduced the best proportional predictor, 
or BPP, of Y given X. It is E*«Y|X) = yX, where y = E(XY)/E(X’). 
For estimating y, the analogy principle suggests the statistic 
T = X;X,Y//X,X?. Assume random sampling. 


(a) Show that Т is a consistent estimator of y. 
(b) Show that T & N(y, $?/n), where 


ф? = Е(О?Х?)ЈЕ?(Х?) = b/b, 


say, with 
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U = Y ~ YX, 

фі = E(x’), 

ф = E(U?X^) = ERY? - 2yXY + ү?Х?)х?] 
= E(X?Y?) — 2yE(X°Y) + ү?Е(Х?). 


(c) Propose a consistent estimator of ф”, for use in constructing an 
approximate confidence interval for y. 


13.4 Continuing Exercise 13.3, suppose that we have a random 
sample of size n — 100 from the joint distribution of (X, Y). In the 
population, X takes on only the values 1, 2, 3, 4, 5, and Y takes on only 
the values 0, 1. The sample observations are 


X 
Y 1 2 3 4 5 
0 11 2 5 9 2 
1 21 7 9 18 16 


Construct an approximate 95% confidence interval for y, the slope of 
the BPP of Y given X. 
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14.1. Population Regression Function 


In most economic contexts, the relation of interest involves more than 
two variables. Economists might consider how the output of a firm is 
related to its inputs of labor, capital, and raw materials, or how the 
earnings of a worker are related to her age, education, race, region of 
residence, and years of work experience. So we will move from simple 
regression (one conditioning variable) to multiple regression (several con- 
ditioning variables). 

The setting for this is a multivariate population in which the k-variate 
random vector (У, Xə, . . . , X,) has joint pdf (or pmf) f(y, xo, . . . , х4). 
The conditional probability distribution of Y given Xə . . . , X, is 
described by the conditional pdf (or pmf) 


gíyIxs, tt y Xy) = Қу, Хә,... Xfi (Xe, 5 хһ). 


Here fi(xs, . . . , x,) is the “joint-marginal” pdf (ог pmf) оЁХ,,..., X,: 
it is “marginal” in that it is integrated (or summed) over one variable, 
but "joint" in that it still refers to several variables. 

'The conditional expectation function, or population regression func- 
tion, of Y given X5,. .. , X, is 


E(Y|Xs, soe ety Xa) = [. 3gQlxs, ey Xz) dy. 


The CEF traces out the path of the conditional means of Y across 
subpopulations defined by the values of the X’s. As in the bivariate case 
(Sections 5.3 and 5.4), the CEF has some distinctive characteristics in 
the population. 
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(1) The CEF is the best predictor of Y given the X's. That is, let U — 
Y — № ), where A( ) = h(Xs, . . . , X) is any function of the X's; to 
minimize E(U*), choose h( ) = E(Y|X,, ... , Ху). 

(2) The deviation from the CEF has zero expectation and is mean- 


independent of all of the X's. That is, let є = Y — E(Y|Xs . . . , Ху); 
then E(e|X,, . . . , X,) = 0. It follows by the Law of Iterated Expectations 
(T8, Section 5.2), that Е(є) = 0 and E(e|X,) = 0 forj = 2,...,k. And 


so it follows that є is uncorrelated with every function of the X's. 

There is another feature of interest in any multivariate probability 
distribution, namely the population linear projection, or BLP, of Y on 
СРО, Е 


ЕҖ(Ү|Х»„,...,Х,) = В, + ВХ +... + ВХ, 


where the fj's are chosen to minimize expected squared deviations of Y. 
That is, let U = Y — h( ), where A( ) is any linear function of the X's. 
If we choose А( ) to minimize E(U?), the solution is E*(Y|X,, . . . , Xa). 
More explicitly, write U = Y — (с, + ZisGX,). 'Then 


aE(U?yàc, = —2E(U), 
3E(U*yà = -9EQQU) (ј = 9,..., k). 


Equating these k derivatives to zero to locate the minimum gives the 
first-order conditions 


E(U = 0 EXU)=0 (7 =2,...,4), 
which taken together are equivalent to 
E(U) = C(Xg, О) =... = C(X,, О) = 0. 


Substituting for U gives the system of k linear equations that determine 
the values of the k B’s in terms of population means, variances, and 
covariances of Y, X5, ..., X,. 

As in the bivariate case (Sections 5.4 and 5.5), the BLP has some 
distinctive characteristics in the population. 

(1) The BLP is the best linear approximation to the CEF. That is, let 
W = r( ) — А ), where r( ) is the СЕЕ and A( ) is any linear function 
of the X's; to minimize E(W’), choose h( ) = E*(Y|Xs, .. . , X;). 

(2) The deviation from the BLP has zero expectation and is uncor- 
related with each of the X's, as shown in the FOC's. 
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As in the bivariate case, we draw on the analogy principle to suggest 
an estimator ofthe BLP, or equivalently of a linear CEF. Suppose that 
we have a sample of n observations from the multivariate population. 
These data take the form 


5» X 2... Xy NNI Хүр 

Yi Хо... Ху e$ a Xi 

Yn Хәм... х} ЕСИК Xnk 
The first subscript indexes the observations (i = 1,..., n); the second 
subscript indexes the conditioning variables (j = 2, .. . , k). The aim is 


to process these data to get estimates of the BLP parameters, the ('s. 
The least-squares analogy suggests that we take as the estimates of the 
fs, the values for the c’s that minimize the criterion 


$ = bler 6... 6) = X ui 


where 
U; = yi — (Cy + сәхә t +++ + exu). 


Solving that problem may be referred to as “running the LS linear 
regression” of Y on Xg,... , X. 

This minimization problem is a purely algebraic one that can be posed 
without reference to population CEF's or BLP's: find the best-fitting 
line in a body of data, where fit is measured in terms of sum of squared 
sample deviations. In the remainder of this chapter, we explore LS 
linear regression in isolation from its probability setting. 


14.2. Algebra for Multiple Regression 


It is convenient to define a variable X,, called "the constant,” that is 
equal to 1 at all observations, and thus to add a column with elements 
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хад = 1 to the display of the data above. Given the n observations on Y, 


X,,..., Ху, the criterion to be minimized is ф = Ут au where 
ш = yj — (Xa + 6X + ++ t саха). 
Differentiating with respect to the c; (for j = 1,..., k) gives 


дф/дс; = > (ди; /ac;) = 5 2u;(ðu;/ðc;) = 2 > и(—х„) 
= —2 У, хуш. 


So the first-order conditions for the minimum are 
Exuw-20 (f= 1,..., №. 
This is a system of k linear equations in ¢,,... ‚су. 


At this point, it is convenient to adopt a vector notation. Define the 
n X 1 vectors 


y= {y;}, x, = {xa} e, Xk = {ха 


and the n X 1 vector 


u = y — (хус, +--+ + X6) = (uj 


(Note the convention: Typical elements of vectors are identified by curly 
brackets.) Then the criterion may be written as 


ф = u'u = ф(с\, ++ Ck) 
and the FOC's may be written as 
xju = 0 (7 = 1,...,Ё). 


А matrix formulation is even more convenient. Define the п х k matrix 


X = (xj, ..., X4), 
and the х 1 vector c = (c,,.. ., c,)'. Then the criterion may be written 
as 

ф = u'u = O(c), 
where 


и = у— Xe, 
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and the FOC's may be written as 
X'u = 0. 


Let c* denote a solution value for c, that is, X'(y — Xc*) = 0 or, 
equivalently, 


X'Xc* — X'y. 


This system of & linear equations in the k elements of c* is known as 
the set of normal equations for LS linear regression. Here Q = X'X = 
ixixj is the k X k symmetric matrix of sums of squares and cross- 
products of the explanatory variables, while X'y = {x;y} is the k X 1 
vector of sums of cross-products of the explanatory variables with the 
dependent variable. 

Two cases arise when we consider solving the normal equations. Case 
1, the full-rank case, holds when the k x k matrix Q is nonsingular, 
equivalently when Q is invertible, has rank А, has determinant {Q| # 0. 
Case 2, the short-rank case, holds when Q is singular, equivalently when 
Q is not invertible, has rank less than k, has determinant |Q| = 0. 

In Case 1, the normal equations Qc* — X'y have a unique solution, 
which we denote as 


b-Q^Xy 


The claim is that b is the unique minimizer of the criterion: that is, 
$(b) < фс) for all c  b. In Case 2, the normal equations Qc* = X'y 
have an infinity of solutions, none of which is expressible in terms of 
the inverse of Q. There the claim is weaker, namely that if c* is a 
solution, then ф(с*) = ф(с) for all c, with equality iff c also satisfies the 
normal equations. Both claims will be verified in Section 14.5. 

Confining attention to the full-rank case, we introduce some termi- 
nology and notation. The least-squares coefficient vector is 


b= Q`'X'y = Ay, 
where A = О 'Х' is k x n. The fitted-value vector is 


ў = Xb = XAy = Ny, 
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where N = XA = XQ 'X' is n X n. The residual vector is 
e= y- $= Iy- Ny= (I - №у = Му, 


where M = I- N = I — XA = I - XQ™'X' is n X n. The minimized 
value of the criterion, the sum of squared residuals from the least-squares 
line, is Ф(Ъ) = e'e. 

Here are some easily verified properties of the Q, A, N, and M 
matrices that prove useful in the sequel: 


(Q^ = Q^, 
AX -(Q'X)X-Q Q-L 
AA' = (Q^'X»)(XQ^!^) = Q'QQ' = Q', 


N' = А'Х' = (XQ !»)X' - XQ`'X' - N, 
NN = (XA)XA = Х(АХ)А = XIA = XA =N, 
M’ =I-N'=I-N=M, 


MM =(I-N)\I-N)=I-N-N+NN=I-N=M, 
NX = (XA)X = X(AX) = XI = X, 
MX = (1- N)X=XK-NX=X-X=0O. 


Observe that the k X k matrix Q™' is symmetric, while the n X n matrices 
M and N are idempotent. (A square matrix T is said to be idempotent iff 


T = T' and TT = Т.) * 
To recapitulate the algebra of multiple regression: we are given the 
data y and X = (x,,..., x,), and asked to find the linear combination 


of the columns of X that comes closest to y in the least-squares sense. 
Then, provided that Q — X'X is nonsingular, the vector b — Ay is the 
unique coefficient vector that solves the minimization problem, the 
vector ў = Ny gives the values of the linear combination, and the vector 
e — My gives the residuals. 


14.8. Ranks of X and Q 
In discussing the normal equations, we distinguished two cases with 


respect to the k х k symmetric matrix Q = X'X: the full-rank case where 
rank(Q) = k, and the short-rank case where rank(Q) « k. We now show 
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that these are equivalently described in terms of the n X k matrix X: 
the full-rank case has rank(X) = k, the short-rank case has rank(X) < k. 
Let d = (dj, ...,d,) be any х 1 vector. Then the n X 1 vector 


v= Ха = х4 +... + xX,d, 
is а linear combination of the columns of X, and the scalar 
v'v = (d'X')(Xd) = d'X'Xd = d'Od 


is a sum of squares. 

Suppose that the rank of X is less than k, the number of its columns. 
This means that there is a nontrivial linear combination of the columns 
of X that equals the zero vector. That is to say, there is a k X 1 vector 
d # 0 such that v = Xd = 0. For that same d, we have Qd = X'v = 0, 
so that there is a nontrivial linear combination of the columns of Q that 
equals the zero vector. That is to say, the rank of Q is less than k, the 
number of its columns. Conversely, suppose that the rank of Q is less 
than the number of its columns. That is to say, there is a vector d # 0 
such that Od = 0. For that same d, let v = Xd. Then v'v = d’Qd = 
4'0 = 0, which means that у = 0 (because a sum of squares is zero iff 
all its elements are zero). That is to say, the rank of X is less than the 
number of its columns. 

We have shown that rank(Q) < k © rank(X) < k, and that rank(Q) = 
k €» rank(X) = &. (In fact it can be shown that rank(Q) = rank(X).) 
Therefore, the two cases may be restated as 


Case 1. Full-rank case: rank(X) = &. 
Case 2. Short-rank case: rank(X) < k. 


This description in terms of X is more useful; it permits us to think 
directly about data situations in which the short-rank case occurs. 


14.4. The Short-Rank Case 


The rank of a matrix cannot exceed the number of its rows or the 
number of its columns. So the short-rank case is guaranteed to arise 
when n < k, that is, when the number of observations is less than the 
number of explanatory variables. But the short-rank case may arise 
even when п = k. That will happen when one of the х is an exact 
linear function of the others. For example, suppose that with п = 100, 
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k = 4, it happens that x, = x, + Xə. Then the nonzero vector d = 
(1, 1, 0, —1)' will satisfy Xd = 0, so rank(X) < 4. 

Why does rank(X) < & rule out unique solution to the normal equa- 
tions Qc = X'y? From a mechanical point of view: suppose that Qc* = 
X'y and that d # 0 satisfies Xd = 0 and hence Qd = 0. Let c** = 
c* + d. Then 


Qc** = Q(c* + d) = Qc* + Qd = Qc* = X'y, 


so c**, which is different from c*, also satisfies the normal equations. 
From a more fundamental point of view: the minimization problem 
seeks the coefficient vector c such that the linear combination Xc = 
хус + --- + хас, comes closest to the observed vector y in the least- 
squares sense. But if rank(X) « k, then there is a nonzero vector d, 
which when added to c, gives a different set of coefficients (c + d) that 
generate the very same linear combination: X(c + d) = Xc. So the same 
best-fitting linear combination is expressible in different ways. 

In the short-rank case, how does one locate a solution to the normal 
equations? Suppose that rank(X) = k* < k, and without loss of generality 
suppose that rank(X*) = k*, where X* consists of the first k* columns 
of X. Let Q* = X*'X*, and let b* = Q* 'X*'y. Then c* = (b*', 0)’, 
where the 0 is (k — &*) X 1, solves the original normal equations. In 
words, run the LS linear regression of y on X* (a full-rank case) and 
assign zero values to the coefficients on the remaining columns of X. 
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We now verify that the FOC's locate the minimum of ф(с) ^ u'u, where 
u = y — Xe. Let c* solve the FOC's, so X'u* = 0 with u* = y — Xc*. 
For any c, let d = c — c*; then 

u = у ~ Xc = y — X(c* + d) = y - Xc* - Xd = u* — Xd. 
Because X'u* = 0, we have 

o(c) = u'u = u*'u* + d'X'Xd = ф(с*) + v'v, 


say, where v = Xd. Because v'v is a sum of squares, we know that v'v = 
0, with equality iff v = 0. It follows that ф(с) = o(c*), with equality iff 
V = 0, that is iff Xd = 0, that is iff Od = 0, that is iff c also solves the 
FOC’s. 
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If rank(X) = А, then the only vector d that satisfies Xd = 0 is the zero 
vector, so Ф(с) = ф(с*) iff d-= 0, that is, iff c = b. This verifies the claim 
that in the full-rank case, b uniquely minimizes ф: (c) > Ф(Ь) for all 
€ # b. If rank(X) < k, there are many vectors that solve the ЕОС: they 
differ from one another by vectors d # 0 that satisfy Xd = 0. This 
verifies the claim that in the short-rank case, ф(с) = $(c*) with equality 
iff c also solves the FOC's. 

We can draw some other implications from the fact that d'Od — v'v 
is a sum of squares. Recall the matrix-algebra concept of definiteness. 
A square symmetric matrix T is nonnegative definite iff for every d, the 
scalar d'Td is nonnegative, and is positive definite iff for every d # 0, the 
scalar d"Td is positive. If the sign of d'Td depends on d, then T is said 
to be indefinite. Now a sum of squares is nonnegative, and is zero iff all 
of its elements are zero. We conclude that Q — X'X is nonnegative 
definite, and further that Q is positive definite iff Xd = 0 implies d = 
0, that is iff rank(X) = k. 


Exercises 


14.1 Let X and y be 


1 2 14 
1 4 17 
X=] 134, у= 8 
1 5 16 
1 2 3 


Calculate the following, using fractions to maintain precision. Feel free 
to factor out a common denominator in displaying a matrix. 


(a) О=х'х, |х'х|, QL. 
(b A = ОХ’, b= Ay. 
(с) N = XA, ў = Ny. 

(d M=I-N, e= My. 
(е) tr(N), tr(M). 


(Note: If T is a square matrix, then tr(T) = trace(T) = sum of diagonal 
elements of T.) 


14.2 For a certain data set with n = 100 observations, the explanatory 
variables include x, = 1, x = a binary variable that is equal to 1 for 
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males and equal to zero for females, and x4 — a binary variable that is 
equal to 1 for females and equal to zero for males. Will the X matrix 
have full column rank? Explain. 


14.3 Let X be an n X k matrix whose rank is k, and let 
Q-XX, A-Q'X, М= ХА, M-I-N. 
Recall that 
b = Ay, ў= Му, е= Му, 


аге the vectors of coefficients, fitted values, and residuals that result 
when an n X 1 vector y is linearly regressed on X. Show the following, 
as concisely as possible. 


(a) AN=A, АМ = О, ММ = О, ММ = О. 

(b) NX = X, МХ = О. 

(с) Ny = ў, Ме = 0. 

(d My = 0, Ме =e. 

(е) X'$ = X'y. 

(£) уў = y Xb = b'X'y = b'Qb = ў'ў. 

(p ее = y'My = yy - y. 

14.4 Show that every idempotent matrix is nonnegative definite. 


14.5 Let m, and n; denote the ith diagonal elements of M and М. 
Show that 0 € m; = 1 апа 0 =n, = 1. 


15 Classical Regression 


15.1. Matrix Algebra for Random Variables 


In this chapter we will establish a model for multiple regression, that 
Is, a population specification and sampling scheme that support running 
LS linear regression to estimate population parameters. In preparation, 
we develop a general matrix-algebra system for dealing with random 
variables. 


Setting aside the regression application, let Y}, ..., Y, be a set of n 
random variables whose joint pdf (or pmf) is Ду... , Jn) The 
expectations, variances, and covariances are (for i, h = 1,..., п): 


E(Y) = pis V(Yj) = о? = Gi, C(Y,, Yi) = On = ©. 


It is natural to display these in an n X 1 vector Y, an n X 1 vector p, 
and an n X n matrix €, where: 


Yi Hi gii. . . Oin 


ка 
tl 

+ 
Au 

M 
Ii 


- Th - 


Yn Bn Oni. . - Opn 


At the risk of some confusion we adopt the matrix-algebra convention of 
lowercase characters for vectors, overriding the statistical convention of 
uppercase characters for random variables, and write the n X 1 random 
vector and its elements as 


у = (Jo -s Ia). 


The expectation of a random vector (or matrix) is defined to be the vector 
(or matrix) of expectations of its elements. The variance matrix of a 
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random vector is defined to be the matrix of variances and covariances 
of its elements. We write 


Е(у) = р, Vy) = 5. 


When y is n X 1, then pisn X 1, and & is n X n symmetric. 
Let e = y — p = (y; — p} be the n x 1 vector of deviations of the y's 
from their respective expectations. So 


єє' = (y - py — BY = li — yi — pl 


is an n X n symmetric random matrix whose elements are the squares 
and cross-products of those deviations. Then 


E(e) = Ely — p) = (Ey) — ud = lis — p} = (0) = 0, 
E(ee') = E((y — py — A] = {с} = X = V(y) = V(e). 


The covariance matrix of a pair of random vectors is defined to be the 
matrix of covariances between the elements of one vector and the ele- 
ments of the other vector. Thus if z = {z,} is an m х 1 random vector 
and у = (yj) is an n X 1 random vector, then 


C(z, y) = ЕЦ: — E(2)ty — Е(у)]'} 


is the m X n matrix whose (h, ijth element is C(z,, yj), while C(y, z) is 
the n X m transpose of that matrix. 

Here are a set of rules for calculating expectations, variances, and 
covariances of certain functions of y. Throughout we suppose that the 
n X 1 random vector y has expectation vector E(y) = p and variance 
matrix V(y) = X, and write є = y — p. The first two rules, which refer 
to linear functions, are straightforward generalizations of T5 and T6 
in Section 5.1. 


КІ. SCALAR LINEAR FUNCTION. Letz = g + hy, where the 


scalar g and the n X 1 vector h are constants. Then the random variabie 
z has 


E(z) = g + h'E(y) = g + һу. 


Further, let z* = z — E(z). Then z* = Һу — Һы = h'(y — p) = h'e, 
and 2#° = (h'e)? = (h'e)(h'e) = (h'e)(e'h) = h'ee'h. So 


V(z) = E(z**) = E(h'ee'h) = h’E(ee’yh = h'V(e)h = h'Xh. 
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Incidentally, since the scalar variance V(z) must be nonnegative, we see 
that every variance matrix 2 is nonnegative definite, and is positive 
definite iff it is nonsingular. 


R2. VECTOR LINEAR FUNCTION. Letz = g + Hy, where the 
k X 1 vector g and the k X п matrix Н are constants. Then the k x 1 
random vector z has 


E(z) ^ g + Hp. 


Further, let z* = z — E(z). Then z* = H(y — p) = He, and z*z*' = 
Hee'H'. So 


V(z) ^ E(z*z*') = E(Hee’H') = НЕ(єє')Н' = HEH’. 


R3. MEAN SQUARES. Let W = yy’. Then the n X n random 
matrix W has expectation E(W) = X + pp’. 


Proof. Write 
yy = (и + ey + €' = pp! + pe’ + єр! + ee, 
which, since p is constant and E(€) = 0, implies 


Е(уу) = ши +>. m 


R4. SUM OF SQUARES. Let v = y’y. Then the scalar random 
variable ш has expectation E(w) = tr(E) + wp. 


Proof. Write 


yy = tr(y’y) = tr(yy’) = tr(W), 


so 


E(y'y) = Eltr(W)] = Ш(Е(И)] = (X + pp’) 


(E) + рн) = (Ж) + (рир) = tr) + p'p, 


using the facts that trace is a linear operator, and that if AB and BA 
are both square matrices, then tr(AB) = tr(BA). m 
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R5. QUADRATIC FORM. Let v — y'Ty, where the n X n matrix 
T is constant. Then the random variable w has expectation E(w) = 
(Т>) + p Ty. 

Proof. Write y' Ty = tr(y Ty) = tr(Tyy’) = tr(TW). Then 


E(y'Ty) 


E[tr(TW)] = tr(E(TW)]  tr(TE(W)] 


tr[T(È + pp’)] = (ТУ) + r(Tpp’) 
(Td) + p'Tp. B 


Кб. PAIR OF VECTOR LINEAR FUNCTIONS. Letz, =g, + 
Н,у, Z2 = go + Hoy, where the m, X 1 vector g,, the m, X 1 vector go, 
the m, X n matrix H,, and the mz X n matrix Н» are constants. Then 
С(21, 2) = Н,ХН;. 


Proof. Let zt = z, — E(z,) = Hye, and zł = 2 — E(z;) = H,e. Then 
zfzj' = Hjee'H,, so 


C(z, 2) = E(zfzi' = H,E(ee'" Н = H,EH;. m 


15.2. Classical Regression Model 


We now set out the statistical model that is most commonly used to 
justify running a sample LS regression to estimate population parame- 
ters. That is, we provide a context for the data, one in which we observe 
a drawing on an n X 1 random vector y and an n X k matrix X = 
(x,,..., X4). The classical regression, or CR, model consists of these four 
assumptions: 


(15.1) Ely) = XB, 
052) V(y) = 021, 
(15.3) X nonstochastic, 
(15.4) rank(X) = k. 


The understanding is that we observe X and y, while В and о? are 
unknown. 
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We interpret the assumptions briefly. In general, an n X 1 random 
vector y = (y, ..., Jn) will have expectation vector p and variance 
matrix È, with 


Hı [Oy . . . Фа 
p=]. |, Xl. Ou. 


ben Oni . oe Cnn 


So in general, the elements of a random vector y will have different 
expectations, different variances, and free covariances. But in the CR 
model, we have 


(151) p= ХВ, 


which says that p; = х; В, where x; is the ith row of X. (Caution: Do 
not confuse x; with the transpose of the ith column of X.) Consequently, 
all n of the unknown expectations, the j1,’s, are expressible in terms of 
k unknown parameters, the 8's. The n expectations may well be dif- 
ferent, but they all lie in the same k-dimensional plane in n-space. 
Further, in the CR model, we have 


(15. XI, 


which says that с, = o° for all і, and that o,; = 0 for all A # i. Thus the 
random variables y,, . . . , у, all have the same variance, and are uncor- 
related. Further, we have 


(15.3) X nonstochastic, 


which says that the elements of X are constants, that is, degenerate 
random variables. Their values are fixed in repeated samples, unlike 
the elements of y which, being random variables, will vary from sample 
to sample. Finally, we have 


(15.4)  rank(X) = k, 


which says that the n х & matrix X has full column rank; its k columns 
are linearly independent in the matrix algebra sense. 

In Chapter 16, we will return to the interpretation of the CR model, 
and to the population and sampling assumptions that underlie it. 
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15.3. Estimation of f 


We proceed to the estimation of the unknown parameters В and g’. 
We have a sample (y, X) produced by the CR model. How shall we 
process the sample data to obtain parameter estimates? The proposal is 
to use ihe sample LP, that is, to run the LS linear regression of y on X. 
Because rank(X) = А, the normal equations of LS linear regression will 
have a unique solution, namely 


b = Ay, where A = Q"'X". 


This А х 1 random vector b is our estimator of В. 

What properties does the estimator have? The matrix A is constant 
because it is a function of X alone. Hence b is a linear function of the 
random vector y, and R2 of Section 15.1 applies. Recalling that AX — 
I and that AA’ = Q', we calculate 


E(b = AE(y) = Ар -A(Xp) -(AXp-1p-B. 
V(b) = AV(y)A’ = AZA'- A(c?*DA' = 07 AA’ = с?Ог'. 


So the LS coefficient vector b is an unbiased estimator of the parameter 
vector B, with E(b) = B, forj = 1,..., А. And the variances and 
covariances of the & random variables in b are given by the appropriate 
elements of the k x А matrix o^Q  ': 

V(b) = o^", С, b) c og", 


wherc q” denotes the element in the Ath row and jth column of Q^. 


15.4. Gauss-Markov Theorem 
We now show that the LS estimator has an optimality property. 


GAUSS-MARKOV THEOREM. In the CR model, the LS coeffi- 
cient vector b is the minimum variance linear unbiased estimator of the 
parameter vector В. 


Proof. Let b* = A*y, where A* is any k X n nonstochastic matrix. 
Then b* is a linear function of y, that is a linear estimator. Rule R2 
gives 
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E(b*) = A*E(y) = A*XB, 
V(b*) = А*у(у)А* = g^A*A*'. 


Clearly b* will be unbiased for B iff A*X = I. Write A* = А + D, where 
A = Q''X' and D = A* — A. Observe that 


A*X = АХ + DX = I + DX, 
A*A*' = (A + DXA + D)' = AA’ + DD’ + AD’ + DA’. 


So the unbiasedness condition A*X = I is equivalent to DX = O, that 
is, to ОХО! = О, that is, to DA’ = О (and hence to AD’ = О). So if 
b* is a linear unbiased estimator of B, then 


V(b*) = o(AA’ + DD’) = V(b) + c?DD'. 


The matrix DD’ is nonnegative definite and the scalar о? is positive, so 
c^DD' is nonnegative definite. Consequently, V(b*) = V(b), with equality 
iff DD’ = O, that is, iff D = O, that is, iff A* = A, that is, iff b* = b in 
every sample. ш 


Some explanations are in order: 

* The matrix DD' is nonnegative definite because for any k X 1 vector 
h, the quadratic form h'DD'h = (D'h)'(D'h) = v'v = 0. 

* If t* and t are random vectors, we say that V(t*) = V(t) iff V(t*) — 
V(t) is nonnegative definite. 

Observe the implications of the nonnegative definiteness of V(b*) — 
V(b). Element by element, b is preferable to b*, any other linear 
unbiased estimator of fj, because its elements have smaller variances. 
But also consider a linear combination of the В;5, say 0 = h’B, where 
h is a constant k x 1 vector. Let t = h'b and let ї* = h'b*. Then both : 
and ¢* are unbiased for Ө, but V(f*) — V(t) = h'[V(b*) — V(by]h = 0. So 
b is also preferable to b* for constructing estimators of linear combi- 
nations. 


15.5. Estimation of с> and V(b) 


For estimation of the parameter c^, we draw on the LS residual vector 
е = My. The matrix M is constant because it is a function of X alone. 
Hence e = My is a linear function of the random vector y, and R2 
applies. Recalling that MX = O and MM’ = MM = M, we calculate 
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Е(е) = ME(y = МХВ = (MX)B = ОВ = 0, 
V(e) = MV(y)M’ = M(c?DM' = o*MM' = o°M. 


Thus, considered as random variables, the residuals е, . . . , e, have 
zero expectations, generally different variances, and nonzero covari- 
ances. Now calculate the expectation of the random variable e'e, the 
sum of squared residuals. Apply R4, with e playing the role of y: 


E(e'e) = u[V(e)] + [E(e)]‘[E(e)] = tr(o^M) + 0'0 = c? tr(M). 
But N = XA and AX = I, so 
tr(N) = tr(XA) = tr(AX) = tr(I,) = &. 
Hence for M = I — N, we have 
tr(M) = tr(I — № = tr(L) — tr(N) =n А. 
So 
E(e'e) = a(n — k). 
Defining the adjusted mean squared residual, 
6? = e'e/(n — k), 


we have E(ó?) = Е(е'е)/(п — k) = с. So б? is an unbiased estimator 
of о”. . 
Finally, we estimate the variance matrix V(b) = с? !, by 


V(b) = 6?Q'^'. 

Because E(ó^) = о” and Q™' is constant, it follows that 
ЕР) = a°Q™! = Vib), 

so that V(b) is an unbiased estimator of V(b). In particular, 
ôi = 647 


2 


is an unbiased estimator of V(b) = оў, = 0°g”. The square root of the 


estimated variance, 
6, = ôV 4Ў, 


serves as the standard error of b. 
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15.1 Suppose that the random vector x has E(x) = p, V(x) = X, and 
that y = g + Hx, where 


1 2.12 
12 1 
p=] 2], Z=(1 3 14, 5- (2). н={(_| 9 1: 
3 2 14 
Calculate E(y), V(y), Elyy’), E(y'y), С(у, х), and C(x, y). 


15.2 Suppose the CR model applies with n = 40, a? = 4, and 


wy [40 10 _{3 
xx = (10 5), 8=(3)- 
Let b be the LS coefficient vector and t = ЪЪ. Find E(t). 


15.3 The CR model applies with о? = 2, and 


ee). s) 


A sample is drawn and the LS coefficients b, and b; are calculated. 


(a) Guess, as best you can, the value of by. Explain. 
(b) Now you are told that b, = 4. Guess, as best you can, the value 
of Б. Explain. 


15.4 The CR model applies along with the usual notation. For each 
of the following statements, indicate whether it is true or false, and 
justify your answer. 


(a) The random variable і = ЪЪ is an unbiased estimator of the 
parameter 0 = В'В. 

(b) Since ў = Ny, it follows that y = № ӯ. 

(c) Since E($) = E(y), it follows that the sum of the residuals is zero. 

(d) If b, and b, are the first two elements of b, (, = b, + b, and ё = 
b, — bs, then V(t) = V(t;). 
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15.5 Show that the LS coefficients b = Ay аге uncorrelated with the 
residuals e = My. Hint: See R6, Section 15.1. 


15.6 Suppose that the CR model applies to the data of Exercise 14.1. 
Report your estimates of the В; parameters, with standard errors in 
parentheses beneath the coefficient estimates. Also report 6°. 


16 Classical Regression: 
Interpretation and Application 


16.1. Interpretation of the Classical Regression Model 


It is instructive to compare our specification of the classical regression 
model to the more customary one. Our CR model is specified as 


(16.1) Ey) Xf, 
(16.3)  V(y) = о?ї, 
(16.3) X nonstochastic, 
(16.4)  rank(X) = k. 


Judge et al. (1988, pp. 178—183) specify a "General Linear Statistical 
Model" as follows (notation has been slighdy changed): 


(16.1) у= XB + є, 


(16.2*) X is a known nonstochastic matrix with linearly independent 
columns, 


(16.35) E(e) = 0, 
(16.4* E(ee') = oL 


The two models are equivalent. Judge et al.'s є is simply the disturbance 
vector, the deviation of the random vector y from its expectation p = 
Xp. In that style, for a scalar random variable y with E(y) = p and 
V(y) = о”, one might write y = р + e, E(e) = 0, E(e”) = o”. There is no 
serious objection to doing so, except that it tends to give the disturbance 
a life of its own, rather than treating it as merely the deviation of a 
random variable from its expected value. Doing so may make one think 
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of p as the "true value" of y and of € as an "error" or "mistake." For 
example, Judge et al. (1988, p. 179) say that the disturbance є "is a 
random vector representing the unpredictable or uncontrollable errors 
associated with the outcome of the experiment," and Johnston (1984, 
p. 169) says that "if the theorist has done a good job in specifying all 
the significant explanatory variables to be included in X, it is reasonable 
to assume that both positive and negative discrepancies from the 
expected value will occur and that, on balance, they will average out at 
zero." Such language may overdramatize the primitive concept of the 
difference between the observed and the expected values of a random 
variable. In any event, we will want to distinguish between the disturbance 
vector € = y — p, which is unobserved, and the residual vector e = 
y — ў, which is observed. 

In what situation would the CR model be justified? Suppose that 
there is a multivariate population for the random vector (y, xə, . .. , 
х)", with pdf or pmf f(y, хо, . . . , x,). Expectations, variances, and 
covariances are defined in the usual manner: 


E(y) = Mys V(y) = о, C(x,, xj) = Taj C(x;, у) = Ty» 


and so forth. Suppose further that the conditional expectation function 
of y given the x’s is linear, 


E(y|xs .. 5%) = Bi + Boxe +... + Bates 

and that the conditional variance function of y given the x's is constant, 
V(ylxs, ..., x) = o°, 

say. We write these compactly as 

(16.5) E(yx) -x'B.  V(lx) = 9°, 


where x = (xj, ... , xj)' with x, = 1, and B = (В,,..., B,)’. 

As for sampling schemes, the most natural one to consider would be: 

Random Sampling from the Multivariate Population. Here n independent 
drawings, (yy, Xi); . . -> (n Xn) are made, giving the observed sample 
data (y, X). In this scheme, the rows of the observed data matrix, namely 
the (y; x), are independent and identically distributed across i. So from 
Eq. (16.5), it follows that E(y;|x;) ^ х;В and V(y,|x,) = o^. But also 
E(y;) = py for all à, V(y) = оу for all 7, and the X matrix is random. So 
this sampling scheme does not support the CR model, in which the 
expectations of the y; differ and the X matrix is not random. 
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Instead of random sampling, we will rely on: 

Stratified Sampling from the Multivariate Population. Here n values of the 
random vector x are specified. These values, the x; (2 = 1,..., л), 
define n subpopulations, or strata. In the ith subpopulation, or stratum, 
the pdf or pmf of the dependent variable is g(y|x,), with E(j|x;) = 
x/B = p; say, and V(y|x,) = с“. A random drawing is made from each 
subpopulation. That is, у, is drawn from subpopulation 1, y? is drawn 
from subpopulation 2, and so on. The successive drawings are indepen- 
dent. In this scheme, the sampled y's are not identically distributed; 
they are drawn from different subpopulations. The list of n selected x; 
vectors is maintained in repeated sampling, so the expectations of the 
successive уѕ will depend only on i. We can then write E(y,) instead of 
E(y;|x,), and similarly we can write V(y,) instead of V(y,|x,). There 
is no need for all the x;s to differ: the relevant requirement is that 
rank(X) = k, so we need & linearly independent (in the matrix-algebra 
sense) x;s. As discussed in Section 13.5, stratified sampling does not 
require that the researcher control the x values in the sense of imposing 
them on the subjects. 

Under stratified sampling, it does not make sense to use the sample 
to estimate the population means and variances of the x's and y. The 
sample on x is not randomly drawn from the population joint distri- 
bution of x, and consequently the sample on y is not randomly drawn 
from the population marginal distribution of y. Still, as in the bivariate 
case (Section 13.5), while stratification on x does induce a new marginal 
distribution for x and y, it preserves the conditional probability distri- 
butions of y given x. That suffices when we are concerned with the 
conditional expectation of y given x. 

This stratified sampling scheme, also known as the nonstochastic 
explanatory variable scheme, will support the CR model. We adopt it 
now in order to simplify the theory. In Chapter 25 we will see how the 
conclusions carry over to the more natural scheme of random sampling. 

Setting aside the sampling aspects, it is useful to compare this discus- 
sion of the underlying assumptions of the CR model with that in other 
textbooks. Johnston (1984, p. 169) seems to say that for the CR model 
to be correct, the theorist must have “done a good job in specifying all 
the significant explanatory variables.” Judge et al. (1988, p. 186) say 
that “it is assumed that the X matrix contains the correct set of explan- 
atory variables. In real-world situations we seldom, if ever, know the 
correct set of explanatory variables, and, consequently, certain relevant 
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variables may be excluded or certain extraneous variables may be 
included." Here "the correct set of explanatory variables" seems to mean 
the variables that "could have or actually determined the outcomes that 
we have observed" (ibid., p. 178). 

Such requirements are very stringent, and have a causal flavor that 
is not part of the explicit specification of the CR model. An alternative 
position is less stringent and is free of causal language. Nothing in the 
CR model itself requires an exhaustive list of the explanatory variables, 
nor any assumption about the direction of causality. We have in mind 
a joint probability distribution, in which any conditional expectation 
function is conceivably of interest. For example, suppose that the 
random vector (у, xs, хз) has a trivariate probability distribution. On the 
one hand, we might be interested in E(y|xs, хз), but on the other hand 
we might be interested in E(y|x;) or, for that matter, in Е(хз| хз, y). It is 
possible that all of those CEF's are linear, and that none of them is 
causal. It may be true that causal relations are the most interesting ones, 
but that is a matter of economics rather than of statistics. More on this 
in Chapter 31. 


16.2. Estimation of Linear Functions of f) 


In the CR model we deal with an n X 1 random vector y. In general 
such a vector would have E(y) = p and V(y) = X. One thing that makes 
the CR model special is the assumption that p = Xf, that is, р, = х;В. 
The п unknown p;s may well be distinct, but all of them are expressible 
in terms of only k unknown fs. 

In the CR model, we estimate В by b, and thus estimate p = ХВ by 
Ё = Xb = Ny = ў, rather than by y itself. Now E($) = p, and also 
E(y) = p, so both the fitted-value vector and the observed vector are 
unbiased estimators of p. Why is it preferable to use ӯ? An answer runs 
as follows. Because V(y) = oI and V($) = oN, we have 


V(y — Vf) = c1 - №) = e?M. 


The matrix M = M'M is nonnegative definite, so V(y) = V(¥). 

With respect to a single element of p, say р: the preferred estimator 
is i; = х = 3; = niy, where n; denotes the ith row of N. A simpler 
unbiased estimator is y; = hjy, where h, is the n X 1 vector with a 1 in 
its ith slot and zeroes elsewhere. Observe that f, is a linear function of 
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all n of the y's, while у; is a function of only one of them. Evidently in 
the CR model it is desirable to combine information from all the obser- 
vations in order to estimate the expectation of a single one. Such a 
preference is clear in random sampling from a univariate population, 
where all the observations have the same expectation. In the CR model, 
the preference persists even though the expectations are not the same. 
The reason is that the expectations are linked together, being functions 
of the same k fj/s. 

Now, р; = Xi is a special case of a linear combination of the B's. The 
general case is 0 = h’B where h is a nonrandom k X | vector. Other 
special cases are of interest. For example, take h = (0, 0, 1, 0,..., 0)’, 
then Ө = Bs; or take h = (0, 1, -1, 0, ..., 0)’, then 0 = Ba — Bs. As 
indicated in Section 15.4, the preferred estimator of such a @ in the CR 
model is і = h'b. We elaborate on that point here. 

By linear function rules, E(t) = h'E(b) = h'B = 60, so that ¢ is an 
unbiased estimator of Ө. Further, V(t) = h'V(b)h = o^h'Q 'h. We can 
express ¢ as a linear function of y: {і = h'b = h'Ay = w'y, where w = 
A'h is п X 1 and nonstochastic. Consider all linear functions of y that 
might be used to estimate 0: /* = w*'y, where w* is a nonstochastic 
n X ] vector. We have 


E(*) = w*'p = w*XD, —V(*) = w*Xw* = o^w*'w*, 
so /* is unbiased iff w*'X = h'. In that event, we can write 
h'Qh = м ХО 'X'w* = w*Nw*, 
and thus write 
V(t) = c?h'Q^'h = o^w*'Nw*. 
Observe that 
V(t) — V(t) = o^w*'(I — N)w* = a^w*'Mw* > 0, 


because M is nonnegative definite. Thus the natural estimator of Ө = 
h'B, namely ¢ = h'b, is in fact MVLUE in the CR model, where "linear" 
means linear in y. 

In practice, we will want to give some indication of the reliability of 
our estimate of Ө. To estimate V(t), replace о? by 6^. The resulting 
standard error for ¢ = h'b is б, = ó'V(h'Q"!h). 
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To recapitulate: in the CR model, whether our interest is in estimating 
the full vector fj, in estimating one of its elements B;, or in estimating 
a linear combination 6 of its elements, the preferred procedure is to 
use LS linear regression. 


16.3. Estimation of Conditional Expectation, and Prediction 


In the CR model, to estimate the parameter p; = E(yj)) = x;B, our 
conclusion was to use j; = x;b. The expectation and variance of this 
estimator are 


EG) = Mo VS) = o^xiQ "x, = OP ngs, 
where n; is the ith diagonal element of №. Now suppose that we are 
interested in estimating a point on the СЕЕ, say ро = х3, where x, is 
some k X 1 vector, not necessarily one of the points at which we have 
sampled. This parameter pọ is the expectation of yo, where y, is a 
random drawing from the subpopulation defined by x — x,. Because 


Џо is a linear combination of the elements of B, the preferred estimator 
for it is {ig = хор, which has expectation and variance 


E(tio) = Mo, V(fio) = еє?хО ‘Xo. 


The standard error for this estimator is 6V(x9Q7'xo). 

Prediction, or forecasting, is a distinct problem. There the objective 
is to predict the value of yo, a single random drawing from the sub- 
population defined by x = xg. If we knew В, our prediction would be 
Ko = х3. The prediction error would be €, = yo — Mo, with expectation 
E(€o) = 0 and variance Е(є2) = Viyo) = o°. In practice, we do not know 
B, but we have a sample from the CR model, from which we have 
calculated b. The natural predictor will be фо = xob. When that pre- 
dictor is used, the prediction error will be u = yo — fio, with 


E(u) = E(y) — E(fio) = 0, 
V(u) = V(3o) + Vlo) — 2C( yo, fio) = 0° + охо xo 
o*(1 + x)Q"!xy), 


taking the covariance to be zero on the understanding that the drawing 
on y, is independent of the sample observations y. This predictor is 
unbiased, and the variance of the prediction error has two additive 
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components: the variance of the prediction error that would be made 
were po known and used, апа the variance of the estimator of ро. The 
"standard error of forecast," which is the square root of the estimate of 


V(u), is given by 6 V(1 + x)Q 'х,). 


16.4. Measuring Goodness of Fit 


In empirical research that relies on the CR model, the objective is to 
estimate the population parameter vector В, rather than to "fit the data” 
or to "explain the variation in the dependent variable." Nevertheless it 
is customary to report, along with the parameter estimates and their 
standard errors, a measure of goodness of fit. 

To develop the measure, return to the algebra of least squares. Given 
the data y, X = (x,,..., x,), we have run the LS linear regression of 
y on X, obtaining the coefficient vector b, the fitted-value vector ӯ, and 
the residual vector e. Observe that y = ў + e, and that $'e = (Xb)'e = 
b'(X'e) = b’0 = 0, by the FOC's. So 


ata 


yy-7($te'$*e-2jyy-ee, 
which algebraically is 


(669 Se DH + У е. 


This is an analysis (that is, decomposition) of sum of squares: the sum of 
squares of observed values is equal to the sum of squares of the fitted 
values plus the sum of squares of the residuals. 

Further, 2,y; = 2,4, + 2,e;, so the mean of the observed values equals 
the mean of the fitted values plus the mean of the residuals: 


у=ў+%. 
Now if @ = 0, then 5 = 5, so nj” = ny, which subtracted from the 
decomposition in Eq. (16.6) gives 


(16.7) > (к= x0 y + да. 


This is an analysis of variation, where variation is defined to be the sum 
of squared deviations about the sample mean. Provided that the mean 
residual is zero, the variation of the observed values is equal to the 
variation of the fitted values plus the variation of the residuals. 
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Divide Eq. (16.7) through by (у, — 5)" to get 


У (ў – э) У 
168) А = —— =] 5. 
i Èo- У (у; – 9) 


The measure R?, which will lie between zero and unity, is called the 
coefficient of determination, or squared multiple correlation coefficient. It 
measures, one says, the proportion of the variation of y that is accounted 
for (linearly) by variation in the хг; note that the fitted value ў, is an 
exact linear function of the x;'s. In this sense, В? measures the goodness 
of fit of the regression. 

Consider an extreme case: 


R=1 © =e =0 € ee=0 © e-0 © у= ХЬ, 


in which case the observed y's fall on an exact linear function of the x’s. 
The fit is perfect; all of the variation in y is accounted for by the variation 
in the x’s. At the other extreme: 


R=0 © E(j-3'-0 e =F forali 


in which case the best-fitting line is horizontal, and none of the variation 
in y is accounted for by variation in the x's. 

From our perspective, R? has a very modest role in regression analysis, 
being a measure of the goodness of fit of a sample LS linear regression 
in a body of data. Nothing in the CR model requires that R? be high. 
Hence a high А? is not evidence in favor of the model, and a low А? is 
not evidence against it. Nevertheless, in empirical research reports, one 
often reads statements to the effect that “I have a high R?, so my theory 
is good,” or “My А? is higher than yours, so my theory is better than 
yours." 

In fact the most important thing about R is that it is not important 
in the CR model. The CR model is concerned with parameters in a 
population, not with goodness of fit in the sample. In Section 6.6 we 
did introduce the population coefficient of determination p^, as a mea- 
sure of strength of a relation in the population. But that measure will 
not be invariant when we sample selectively, as in the CR model, because 
it depends upon the marginal distribution of the explanatory variables. 
If one insists on a measure of predictive success (or rather failure), then 
6? might suffice: after all, the parameter с” is the expected squared 
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forecast error that would result if the population CEF were used as the 
predictor. Alternatively, the squared standard error of forecast (Section 
16.3) at relevant values of x may be informative. 

Some further remarks on the coefficient of determination follow. 

* One should not calculate А? when 2 # 0, for then the equivalence 
of the two versions of А? in Eq. (16.8) breaks down, and neither of them 
is bounded between 0 and 1. What guarantees that 2 = 0? The only 
guarantee can come from the FOC's X'e = 0. It is customary to allow 
for an intercept in the regression, that is, to have, as one of the columns 
of X, ihe n X 1 vector s = (1, 1, .. . , 1)'. We refer to this s as the 
summer vector, because multiplying s' into any vector will sum up the 
elements in the latter. If s is one of the columns in X, then s'e — O is 
one of the FOC's, so @ = 0. The same conclusion follows if there is a 
linear combination of the columns of X that equals the summer vector. 
Also if y and x5, . . . , x, all have zero column means in the sample, 
then @ = 0. But otherwise a zero mean residual is sheer coincidence. 

* We can always find an X that makes R? = 1: take any n linearly 
independent n X 1 vectors to form the X matrix. Because such a set of 
vectors forms a basis for n-space, any n X 1 vector y will be expressible 
as an exact linear combination of the columns of that X. But of course 
"fitting the data" is not a proper objective of research using the CR 
model. 

* The fact that А? tends to increase as additional explanatory variables 
are included leads some researchers to report an adjusted (or "cor- 
rected”) coefficient of determination, which discounts the fit when А is large 
relative to n. This measure, referred to as R? (read as ^R bar squared”), 
is defined via 


1 А? = (n — 1)(1 — R?y(n — k), 


which inflates the unexplained proportion and hence deflates the 
explained proportion. There is no strong argument for using this par- 
ticular adjustment: for example, (1 — k/n)R? would have a similar effect. 
It may well be preferable to report R?, n, and k, and let readers decide 
how to allow for n and k. 

* The adjusted coefficient of determination may be written explicitly 
as 


(16.9) R?=1- [2 eifn ~ ю|/|> (у — 3I — »]. 


i 
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It is sometimes said that in the CR model, the numerator X,e//(n — k) 
is an unbiased estimator of the disturbance variance, and that the 
denominator Zj(y; — y)/(n — 1) is an unbiased estimator of the variance 
of y. The first claim is correct, as we know. But the second claim is not 
correct: in the CR model the variance of the disturbance is the same 
thing as the common variance of the y,, namely о”. 


Exercises 


16.1 Continuing the numerical example of Exercises 14.1 and 15.6, 
assume that the CR model applies. Let 0 = B, + By. Report your estimate 
of 0, along with its standard error. 


16.2 The CR model applies to E(y) = XB with o^ = 1. Here X is an 
n X 2 matrix with 


mw (4 72 
xx-( 3 F 


You are offered the choice of two jobs: estimate B, + Bs, or estimate 
B, — Bs. You will be paid the dollar amount 10 — (¢ — 0)", where 1 is 
your estimate and Ө is the parameter combination that you have chosen 
to estimate. To maximize your expected pay, which job should you take? 
What pay will you expect to receive? 


16.3 In a regression analysis of the relation between earnings and 
various personal characteristics, a researcher includes these explanatory 
variables along with six others: 


.]l if female _ |1 if male 
"n 0 if male Ха т l0 if female 


but does not include a constant term. 


(a) Does the sum of residuals from her LS regression equal zero? 
(b) Why did she not also include a constant term? 


16.4 Consider the customary situation, where the regression includes 
an intercept, and the first column of X is Xie the summer vector. 
Let M, = I — x,(xjx,) ‘xj. 
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(a) Show that Myy is the vector of residuals from a regression of y 
on the summer vector alone. 

(b) Show that у'М,у = £y; — y). 

(c) Further suppose that the CR model applies to E(y) = XB. Apply 
R5 (Section 15.1) to show that 


E[Zo.- 9°] = e - Do? + Вихеа, 


where D. is the (А — 1) X 1 subvector that remains when the first 
element of B is deleted, X, is the n X (k — 1) submatrix that 
remains when the first column of X is deleted, and Xf = M,X,. 
Evaluate the claim that in Eq. (16.9), the denominator of the 
adjusted coefficient of determination is an unbiased estimator of 
the variance of the dependent variable. 


(d 


— 


16.5 GAUSS is a mathematical and statistical programming language, 
produced by Aptech Systems, Inc., Kent, Washington. We will rely on 
it frequently in the remainder of this book, presuming that it is installed 
on a computer available to you. Appendix B provides some introductory 
information about GAUSS; other information will be provided as hints 
in subsequent exercises. Version 1.49B of GAUSS is used here; modi- 
fication to other versions should be straightforward. 

Here is a GAUSS program to re-do Exercise 14.1. Enter it, run it, 
and print out the program file and the output file. 


/* ASG1605 */ output file = ASG1605.OUT reset; format 8,4; 
let x1 = 1 1 1 1 1; let x2 = 2 4 3 5 2; let y = 14 17 8 16 3; 
X = x1°x2; Q = X'X; dq = det(Q); QI = invpd(Q); 


A = QI*X'; N = X*A; I = eye(5); M =I- N; 

trn = sumc(diag(N)); trm = sumc(diag(M)); b = A*y; yh = N*y; e = M*y; 
"Q= к Q; 5"detQ = dq; ?; 

"Q inverse = ” QI; ?; 

"М = M М; 5" M = " M; ?; 

"(М = " un; "tr(M) = " um; ?; 

"b= " b ? 

"yhat' =” yh; 5" e =” е; end; 


16.6 The algorithm used in Exercise 16.5 is not an efficient way to 
run LS linear regressions. Here is a more sensible way, which may serve 
as a starting point for your own future regression programs. The pro- 
gram also calculates the sum of squared residuals. Enter the program, 
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personalizing it by completing names for the program and output files, 
and also entering your name. Run and print. 


/* —1606 */ output file = —1606.OUT reset; format 8,4; 
"Student name n 

ltx1-11111;letx2224352;lety = 14178 16 8; 

X = x x2; Q= X'X; QL = invpd(Q); b = QI*X'y; sse = y'y — b'X'y; 
"b = "pb к," sse = "sse; end; 
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17.1. Regression Matrices 


In this chapter we explore a variety of algebraic and interpretive results 
that may be relevant for empirical applications of multiple regression. 
Associated with any n X k full-column-rank matrix X are the matrices 


Q-XX, A-Q'X, N-XA, M-I-N. 


Multiplied into any n X 1 vector, the matrices A, N, M will produce 
respectively the coefficients, fitted values, and residuals from the LS 
linear regression of that vector upon X. With that in mind, we can 
interpret the results 


AX =I, NX =X, МХ = 0. 


Suppose we regress the jth column of X, namely x,, upon all the columns 
of X (including the jth). Since LS linear regression chooses the linear 
combination of the columns of X that comes closest to Xj, it is obvious 
that it will produce 1 as the coefficient on the jth explanatory variable, 
and 0’s as the coefficients on all the other explanatory variables. It is 
equally obvious that the fit will be perfect: the fitted values will equal 
the observed values and the residuals will all be zero. That is, 
Ax; = д, Nx; = x, Mx; = 0, 

where d; 15 the jth column of the k X k identity matrix. Assembling those 
unusual regressions for j = 1,..., k, we have indeed 


AX =I, NX = X, MX = O. 
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17.2. Short and Long Regression Algebra 


In empirical work, it is common to run a series of regressions, with 
successively shorter (or longer) lists of explanatory variables. We develop 
some algebra that is relevant to this practice. 

Given as data the п X i vector y and the n х k matrix X = (x,,..., 
xy), whose rank is k, we regress y on X. That is, we choose c to minimize 
u'u, where u = y — Xc, producing the coefficient and residual vectors 


b = Ay, e = My, 
where 

A-(XX)'!X, M-I- ХА, 
and 

AX = I, MX = O, X'e = 0. 


Partition X as X = (X,, Х,), where X, is n X А, and Xs is n X hp. 
Correspondingly, partition b as (bj, b3)’, where b, is k, X 1 and b; is 
kə X 1. As a result of the fit, we have 


07.1) у= Xb + e= (X, Xj) г) + е = Xjb, + ХБ, + e. 
2 


Because X'e = 0, we know that Xje = 0 and Х;е = 0. 

Suppose that we shorten the list of explanatory variables and regress 
у on only the first А, columns of X. Regressing y on X,, that is, choosing 
the kı X 1 vector c, to minimize u*'u* where u* = y — X;c,, is a full- 
rank problem. The resulting coefficient vector and residual vector are 


bt-Ay, е* = Му, 
where 

A, = (ХІХ) Xi  M,-I- XA, 
and of course 

A,X, = I, M,X, = O, Xje* = 0. 
As a result of this fit, we have 


(17.9) у= Xjbf + e. 
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How are bf and e* related to b, and e? That is, how is the short 
regression (Eq. 17.2) related to the long regression (Eq. 17.1)? Here "short" 
and "long" refer to the length of the list of explanatory variables. 

To obtain the relations, consider the auxiliary regressions. Regress each 
column of X, in turn upon X, to obtain a set of auxiliary regressions: 


х= Xft xt (j-k-l...k, 


J 


where f, = A,x; and хў = M,x;. Assemble all these as 
(73) X= Xj + Xe. 
Here 

F = (fier. + £) = А.Х, 


is a k, X ky matrix, each column of which contains the coefficients from 
the regression of a column of X; upon (all the columns of) X,, and 


X$ = (xh... хф) = Mix. 


is an n X ky matrix, each column of which contains the residuals from 
the regression of a column of X; upon (all the columns of) X;. Because 
X has full column rank, so does X$: see Exercise 17.3. 

Use Eq. (17.1) to calculate 


(17.4) bf- Ауу = A,(X,b, + Xjb, + е) = b, + Fb, 


because A,e = (Х!Х) 'Xje and Xje = 0. What Eq. (17.4) says is that 
the coefficients on X, in the short regression are a mixture of the 
coefficients on X, and on Х, in the long regression, with the auxiliary 
regression coefficients serving as weights in that mixture. 

Use Eq. (17.1) again to calculate 


(17.5) e* = М,у = M, (Xb; + Xb + e) = Xb, + e, 


because M,e = (I — X,A,)e and A,e = 0. What Eq. (17.5) says is that 
the residuals from the short regression equal the residuals from the 
long regression plus a mixture of the elements of X£, with the elements 
of b, serving as weights in the mixture. 

Use Eq. (17.5) along with the facts that X#’ = ХМ, Ме = e, and 
X5e = 0 to calculate 


(17.6)  e*'e* = e'e + boXE' Xiib.. 
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Equation (17.6) says that the sum of squared residuals from the short 
regression exceeds the sum of squared residuals from the long regres- 
sion by the nonnegative quantity v'v, where v = X£b,. So shortening a 
regression cannot improve the fit. The two sums of squared residuals 
are equal iff v = 0, that is, iff X#b, = 0, that is (because rank(X$) = 
hy), iff bo = 0. A simpler argument leads to the same conclusion: running 
the short regression is equivalent to running the long regression subject 
to the constraint that the coefficients on X, be zero; a constrained 
minimum cannot be less than an unconstrained one, and the minima 
are equal iff the constraint is not binding. 

We have emphasized the contrast between the short and long regres- 
sions, but there are exceptional cases: 


(1) If b, = 0, then bt = bj, e* = e, and e*'e* = e'e. 

(2) If XiX, = О (each variable in X, is orthogonal, in the matrix- 
algebra sense, to every variable in X5), then Е = A,X, = О and 
Xf = X. so bf = b, although e* # e. 


17.3. Residual Regression 


When we run the long regression of y on X, and X, rather than a short 
regression of y on X, alone, to get coefficients on Х,, it is natural to say 
that we are getting the effect of X, “after controlling for X,,” or "after 
allowing for the effects of X,," or "after holding X, constant." We can 
develop some algebra that gives content to that language. 

Consider the residual regression. Regress y on Х# = M,Xo, the n X ky 
matrix of residuals. from the auxiliary regressions of X; on X,. The 
coefficient vector will be 


с = Ady, 
where 
Аў = QU'XpCXP = XPXD7XIM.. 
Now 
M;y = M,(Xib, + Xb, + e) = X£b, + e, 
X;Xj$ = ХМ,Х, = X;MjM,X, = ХХҒ, 
Хе = 0. 
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So 
Co = Азу = Do, 


which is the & x 1 lower subvector of b, the coefficient vector in the 
long regression of y on X. To restate the result, the b, subvector of b 
is obtainable by a two-step procedure: 


(i) Regress Xs on X, to get the residuals X£, 
(ii) Regress y on X£ to get the coefficients by. 


Because AM, = Až, it is also clear that we obtain the same b; by a 
double residual regression, regressing y* = M,y on X£. With b, in hand, 
we can complete the calculation of the long-regression coefficient vector 
b, by regressing y on X, alone, obtaining bf = A,y, and then recovering 
b, as b; = bt — ЕБ», where Е = A, Xz. 

The residual regression result, namely b, = cz, gives content to the 
language used above. For с, indeed relates у to X; "after controlling 
for the effects of X," in the sense that only X$—the component of X, 
that is not linearly related to X,—is used to account for y. For example, 
in looking for the relation of earnings to experience in a regression that 
also includes education, we are in effect using not experience, but only 
the component of experience that is not linearly associated with edu- 
cation. 

The situation here is quite reminiscent of the distinction between 
partial and total derivatives in calculus. Indeed bf = b, + Fb, has the 
same pattern as dy/dx; = dy/dx, + (dxs/dx,)(8y/àx;). 


17.4. Applications of Residual Regression 


The residual regression results are remarkably useful for theory and 
practice. 

Trend Removal. A popular specification for economic time series takes 
the form E(y,) = Fei Byy> where ¢ indexes the observations, x, = 1, хә = 
і, and xg, ... , x, are conventional explanatory variables. Неге x, allows 
for a linear trend in the expected value of the dependent variable. The 
question arises: should the trend term be included in the regression, or 
should the variables first be “detrended” and then used without the 
trend terms included? In the first volume of Econometrica, Frisch and 
Waugh (1933) concluded that it does not matter. The residual regression 
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results apply. Let X, = (x,, xg) and X; = (хз,..., xj). The coefficients 
on the latter set will be the same whether we include X, in the regression 
along with X,, or alternatively first detrend X, (by calculating the resid- 
uals X¥ from the regression of X, on Xi) and then use only those 
detrended explanatory variables. 

Seasonal Adjustment. Some macroeconomic variables show a scasonai 
pattern. When tracking such a variable, it is useful for some purposes 
to "deseasonalize," or “seasonally adjust,” the variable. Suppose that we 
have data on the variable y, quarter by quarter, for m years, with y,, 
being the value for year t, quarter h. Suppose that we are looking for a 
business-cycle pattern in this series, but we notice that first-quarter 
values are typically low and third-quarter values are typically high. It 
seems sensible to deseasonalize before judging, say, whether the series 
is now at a cyclical peak. A conventional way to deseasonalize is to 
calculate the seasonal means, ïi, Y2, Js, 3, say, and express each obser- 
vation as a deviation from its seasonal mean: yj, = y, — Ja These y*'s 
form a seasonally adjusted series. (The grand mean y can be added back 
in to restore the level of the series.) The cyclical standing of the variable 
may be more apparent in the у* series than in the у series. 

This calculation can be performed by regression. We have n = 4m 
observations on y, arranged by quarters within years. Define the four 
"seasonal dummy variables": 


eR |; in quarter 1 Е P in quarter 2 
! 0 otherwise z 0 otherwise 

n i in quarter 3 PAS |, in quarter 4 
> 0 otherwise 4 0 otherwise 


Let X, = (x,, X2, хз, X,). Then regress y on X, to get coefficients bt = 
Ay, and residuals у* = Miy. These residuals form the seasonally 
adjusted series. (Again the grand mean can be added to restore the 
levels.) 

To verify the assertion, observe that in view of the arrangement of 
the data (seasons within years), X, = (I, I,..., I)’, where each of the 
Isis 4 х 4. Then ХХ, = ml, and Хуу = m(y, Jo, Ys, Ja)’. So the 
coefficient vector is Ь = (J1, Js, Js 94)’, and the residuals are y* = 
Muy = {у — X) as asserted. 

Linear Regression with Seasonal Data. Suppose that we are interested in 
the. regression relation between y and a set of explanatory variables X4. 
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Ordinarily we would regress у on the summer vector s = (1,..., 1)’ 
and X, to get an intercept and a set of slopes. But if у and X, have 
seasonal components in them, we might want to remove the seasonals 
first. We have already done this for y. To deseasonalize X;, regress X, 
on the seasonal dummy variables X,, getting coefficients Е = A,X, and 
residuals X$ = М,Х,. Then regress the seasonally-adjusted dependent 
variable y* = Myy on the seasonally adjusted explanatory variables X$ 
to get the slopes cy. Residual regression theory tells us that 


с; = Аўу* = Ady = bs, 


which are the coefficients on X; in the long regression of y on (Х|, Xo). 
Running the regression on seasonally adjusted data in effect allows for 
parallel shifts in the relationship of y to X,—that is, separate intercepts 
for each of the quarters. To recover those intercepts, use b, = 
bf — Fb.. 

Deviations from Means. Yn simple regression, where there is only one 
explanatory variable along with the constant, j; = a + bx;, the LS slope 
and intercept can be calculated as 


)j-[Ee-29-»/[xe-»] а=ў-ж. 


It is also true that 
i [Ze m|/[I 27]. 


Deviations from the mean are residuals from regression on the summer 
vector, so these well-known formulas are applications of residual regres- 
sion theory. 

Turn to multiple regression of y on X = (ху, Xo), where x, is the 
summer vector. The fitted regression is 


y = хр; + Xybo, 
where b, is the intercept and b, is the slope vector. Here 
M,-7I-x,xix) xj; = I — (l/n) xixi 


is the idempotent matrix which, when multiplied into any column vector, 
produces deviations from the column mean. So y* = M,y and X¥ = 
M, X; are the variables expressed as deviations from their respective 
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means. Residual regression theory says that the slopes can be calculated 
as 


b, = (XX) Ху = (ХХ) Ху», 
and the intercept recovered as 
b, = bf — Fb, = y — Xib,, 
where 
X) = АХ = (xixj) 'xiX, = (1/n) xiX, 
is the row vector of means of the explanatory variables. Thus the 
familiar device that expresses variables as deviations about their means 


before running the regression carries over to the multiple regression 
case. 


17.5. Short and Residual Regressions in the Classical 
Regression Model 


We have considered the short and residual regressions from an algebraic 
perspective. We now reconsider them in a statistical context. Suppose 
that the CR model applies, so that E(y) = XB, V(y) = o°I, X nonsto- 
chastic, rank(X) = k. Partition as follows: 


Х = (Х,, Хә, B= (6) А 
where X, is n х k,, Xs is n X ko, B, is А X 1, and B; is А X 1. We have 
E(y) = Х.В, + X). 
The long regression gives 
у= Xb + e = Xib, + Xb: + e, 


and we know that 


E Юю - t9 УЫ) =P= о? (8. Өч) | 


where the Q's with superscripts denote submatrices in the inverse. 
Short-Regression Coefficients. Consider the regression of у on X, alone. 
If the CR model applies to E(y) = X,B, + Xs, then in general E(y) 7 


190 17 Regression Algebra 


X,B,. Nevertheless we can evaluate the short regression results. The 
coefficient vector is bf = Ауу. Apply R2 (Section 15.1) to obtain 


07.7) E(bf) = A,E(y) = А (XB; T Х.В) = В, + ЕВ», 
(17.8 УФ) = A,V(y)Ai = oA, At = oX X`}, 


using A,X, = I, A,X, = Е, and А.А! = (XjiXy) 

From Eq. (17.7), we conclude that in general bf is a biased estimator 
of Bı, a result known as omitted-variable bias. The exceptional cases are: 

(a) Irrelevant Omitted Variables. If Ba = 0, then the CR model does 
apply to the short regression E(y) = Х,В,. Here bf and b, differ in any 
sample but have the same expectation. 

(b) Orthogonal Explanatory Variables. 1f F = О, then bf and b, coincide 
in every sample, even though the CR model does not apply to the short 
regression. 

From Eq. (17.8) it follows that V(b,) = V(bf). Rewrite Eq. (17.4) as 
b, = bf — Fb,. Now 


C(bf, bə) = A, V(y)A#’ = o^A,A?' = О, 
because M,X, = О implies АЎА! = О. Consequently, 
(17.9 V(b.) = V(bf) + FV(bj)F'. 


Because FV(b;)F' is nonnegative definite, we have V(b,) = V(b). 

Observe that the variance matrix of bf does not depend on the true 
value of В», although the expectation of bf does. Whether or not В, = 
0, the short-regression coefficient estimator has smaller variance. This 
suggests that in practice there may be a bias-variance trade-off between 
short and long regressions when the target of interest is B,. Observe 
also that the short-regression coefficient vector bf is an unbiased esti- 
mator of a certain mixture of parameters: E(bf) = Bf, where Bf = В, + 
ЕВ». We return to these two observations in Chapter 24. 

Short-Regression Residuals. Next, turn to the short-regression residuals, 
e* = Myy. We have 


E(e*) = M,E(y) = M,(X,B, + Х„В„) = Xtp.. 
V(e*) = M, V(y)M; = o™M,, 


using МХ, = O, М,Х, = Xi, M,M, = M,. So in general the 
short-regression residual vector has a nonzero expectation. Because 
rank(X3) = kə, the only exceptional case is В, = 0. For the sum of 
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squared residuals, we have e*'e* = y'M;y, and R5 (Section 15.1) applies 
with M, taking the role of T: 


E(e*'e*) = |М, V(y)] + Е(у)' М, Е(у). 
But M,V(y) = o?M,, M,E(y) = ХҰВ,, and tr(M,) = n — А, so 
E(e*'e*) = o? tr(M,) + В:Х ХВ, = o?(n — hy) + ВХ ХВ. 
And E(e'e) = a?(n — k), so 
E(e*'e*) — E(e'e) = о“ + BoX3'XEBo. 


On the right-hand side, the first term is positive and the second term 
is nonnegative, and is zero iff Bs = 0. 

We conclude that omission of explanatory variables leads to an 
increase in the expected sum of squared residuals, even if В, = 0. The 
increase in expectation should come as no surprise because we know 
from Eq. (17.6) that e*'e* — e'e = b;X$'Xjb, = 0 in every sample, with 
equality iff b; = 0. Omitting X, never reduces the sum of squared 
residuals, and almost always increases it, so on average it must increase 
it. 

Residual Regression. Finally, consider the residual regression of y on 
Xf = M,X,. The coefficient vector is c; = Аўу = by. Apply R2 to find 


Еф») = ASE(y = АЎ(Х,В, + Xi) = В, 

Уф») = A£V(y)AZ' = с?АҘАЎ: = o'(XE'X$) ' = o'(Qi) ', 
using 

АХ = О, A#X,=I1, A#A3’ = (ХХҒ) |, 
and defining 

Qs, = Xi'Xf = X;M,X, = ХХ, — ХХ (ХІХ) XIX,. 


But b, is the lower А, X 1 subvector of b, so V(b;) is also given by the 
southeast (ks X ks) block of the matrix V(b) = o°Q™'. Thus we have 
proved an algebraic result: 


SUBMATRIX OF INVERSE THEOREM. Suppose that a positive 
definite matrix Q and its inverse ОГ! are partitioned conformably as 


o- (er o). e= (9: ge): 
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where the diagonal blocks are square. Then Q” = (Q$) ', where 


О» = Qə — Qo1(Qi1) !Q;.. 


Exercises 


17.1 Using national time series data, I run the linear regression of y, 
(consumption) on x, (the summer vector) and x, (disposable income), 
obtaining Ь,, $,, and e, as the vectors of coefficients, fitted values, and 
residuals. I also run the linear regression of у» (saving) on the same two 
explanatory variables, obtaining by, $5, and e. In the data, of course, 
consumption + saving = disposable income, so y, + у» = xs. 


(a) Use the A, N, M matrices to show as concisely as possible that 


0 
bi +b = (1). Ji $2 = Xs, ei t es = 0. 


(b) Show that the sum of squared residuals for the savings regression 
is identical to the sum of squared residuals for the consumption 
regression. 

(c) True or false? (Explain briefly.) The coefficients of determination, 
the R”s, are the same for the two regressions. 


17.2 Let Z = XT, where X is an n X k matrix with rank k, and T isa 
k X k matrix with rank k. Let b and e be the LS coefficient and residual 
vectors for regression of an n X 1 vector y on X. Show that regression 
of y on Z gives coefficient vector c = Т-Ъ and residual vector e* = e. 


17.3 Suppose that the n х k matrix X = (X,, Xə) has full column 
rank. Let X$ = М,Х, be the n х kə matrix of residuals from the auxiliary 
regression of X, on X,. Show that rank(X}) = kə. Hint: Use proof by 
contradiction. 


17.4 Table A.3 contains a cross-section data set on n = 100 family 
heads from the 1963 Survey of Consumer Finances, as taken from Mirer 
(1988, pp. 18—22). The variables are: 


V1 = Identification number (1,..., 100) 
V2 = Family size 
V3 = Education (years) 
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V4 — Age (years) 
V5 — Experience (years) 
V6 — Months worked 
V7 = Race (coded as 1 = white, 2 = black) 
V8 = Region (coded as 1 = northeast, 2 = northcentral, 3 = south, 
4 = west) 
V9 = Earnings ($1000) 
V10 = Income ($1000) 
V11 = Wealth ($1000) 
V12 = Savings ($1000) 


Experience is defined as age — education — 5. We will use this data set, 
presumed to be available as an ASCII file labeled SCF, frequently. 
Run these four linear earnings regressions: 


(а) y on x,, x». 

(b) y on x,, X2, Xs, X4. 

(c) y on Xi, Хә, Xs, X4, Xs. 

(d) y on xi, X2, Xs, X4, Xs, Xes X7, Хв. 


Here у = earnings; x, = l; x, = education; хз = experience; x, = xi; 
= ] if black, 0 if white; x, = 1 if northcentral, 0 otherwise; x, = 1 if 
south, 0 otherwise; xg = 1 if west, 0 otherwise. 

For each regression in turn, assume that the CR model holds. Report 
coefficient estimates, their standard errors, the estimate of o, and the 
R°. Also report the means and standard deviations of all variables used 
in (d). 


GAUSS Hints: i 


(1) n = 100 load D[n 12] = scf [This will read in the data set 
as a 100 x 12 matrix called Р] 

(2) у = D[ 9] [The vector y is equal to the 9th column of D] 

(3) Ifxisann X 1 vector, then z.— x DESDE 1 vector of 
squares of the : elements fx =“ 

(4) If xand w are n x 1 vectors then z =x == wisthen X 1 


vector with 1 s where’ the elements mx x and w are equal and 
зз 


iere ie column vector 
t$'of D* | 


whose elemients are the diagonal el 
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17.5 Comment on any aspect of the regression results in Exercise 17.4 
that puzzles you. 


17.6 Continuing Exercise 17.4, let X, = (xj, Xo, хз, x4) and X, = 
(X5, х6, X7, Xa). Write and run a program to: 


(a) Calculate the auxiliary regressions of X, on X,, obtaining the 
coefficient matrix Е = (XjX,) 'XjXs and the residual matrix 
Xf = X, — Х.Е. 

(b) Calculate the residual regression of y on X#, obtaining the coef- 
ficient vector с». 

(c) Use those results, along with those found in Exercise 17.4, to 
verify numerically the relations bf = b, + Fb, and c; = bo. 


18 Multivariate Normal Distribution 


18.1. Introduction 


For the classical regression model we now have considerable information 
on the sampling distributions of the LS statistics b and e'e. That infor- 
mation, which concerns expectations, variances, and covariances, suf- 
fices to justify the use of certain sample statistics as estimates of the 
population parameters, and to provide estimates of their precision as 
well. We have seen why the sample LS coefficient b; serves as an estimate 
of Bj, and why its standard error, б, = бу, serves as an estimate of 
its standard deviation, V V(b) = с, = o V4J. 

But we need more information to undertake further exact statistical 
inference for the regression parameters, that is, to construct exact con- 
fidence intervals and to test hypotheses at exact significance levels. Fol- 
lowing the traditional practice, we will specialize the CR model to the 
case where the y’s are normally distributed, and then deduce the exact 
sampling distributions of the LS statistics. As a preliminary, in this 
chapter we set aside the regression context in order to develop some 
general theory on the multivariate normal distribution. 


18.2. Multivariate Normality 


Suppose that the joint pdf of the n X 1 random vector y is 
fly) = (2т) |е 
where 


w = (у — р) 5 Ky — p) isa scalar, 
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у is an п X l parameter vector, 


X is an n X n positive definite parameter matrix, 
[|712 = 1/Vdet(2). 


Then we say that the distribution of y is multivariate normal, or multi- 
normal, with parameters p and X, and write у ~ N(p, X). 

Consider a couple of special cases. If n = 1, then y = у, p = p, y= 
o^, ш = 2°, with z = (y — p)/o. So f(y) is the familiar univariate normal 
density f(y) = exp( 7/2)/ V (2«1a?). 

If n = 2, then 


= [7! ER E z= (9 dar 
? o d i і Сәу O22 


Ѕо 

|E| = 91,955 ~ ois = 01001 — p^), 
where 

p-7o0,(7,0, o =O’ Gs = Oo, 
and 

rac, Tw) 
Also 


ш = (21 + 25 — 2pzz9)/(1 — p) 


with 2; = (y — MiO, zo = (ye — psy 0s. So f(y, Y2) is the bivariate 
normal density of Section 7.3. 

Returning to the general multivariate case, partition the n х 1 vector 
y into the n, X 1 vector y, and the n; X 1 vector у», and correspondingly 
partition р and X: 


= [ў S ra) | Sus & x) 
т (5) à i Xn Xj 
We can now state these generalizations of the properties shown for the 


bivariate normal in Section 7.4. 


If y ~ Му, X), then: 
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Pl. The expectation vector and variance matrix are E(y) = н and 
V(y) = X, thus justifying the symbols used for the parameters. 


P2. The marginal distribution of y, is multinormal: 


У ~ Хр. X). 


P3. The conditional distribution of у» given y, is multinormal: 


yslyi ~ Хр, Lo), 


where 


pf = E(ys|y) = © + B’y,, 

В = (>) Xs 

а = pa — Bp, 

УХ, = Viyoly:) = Xs, — B'X,B. 


Observe that the СЕЕ vector is linear in y,, and that the conditional 
variance matrix is constant across у. 


P4. Uncorrelatedness implies independence: If $, = О, then y, 
and ys are independent random vectors. 


Proof. If X, = О, then B = О, so pf = а = ps, and Ly = Log. 
Consequently, у, |у, ~ Хы», Xə) for all y,. These conditional distri- 
butions are all the same—they all coincide with the marginal distribution 
Yo ~ Хро, Ўз) —0 the vectors are stochastically independent. m 


Of course, the roles of y, and ys can be reversed throughout. 
Consider a special case. Take the bivariate normal distribution, by 
setting n, = no = 1. Then P2 and P3 specialize to 
у ~ Мр, 04), yl: ~ Ма + Ву, о?), 


where 


- 23, э _ 2 
Ё = TiC, а = Be — Ву, O = Og — В}, 
as in Section 7.4. 


For a second special case, set n, = n — 1, n; = 1. Неге y, is scalar, 
while y, is a vector. The variables and parameters partition as 


198 18 Multivariate Normal Distribution 


= { Ў! = {Pi = ^m 2s 
Я p і " in , a S22)’ 
and we have 


эу ~ Ма + B'yi, o°), 


where 


В = (uo; а= po — B’p,, с? = с» ~ BX 1B. 


Evidently this case specifies a population that could support the CR 

model: the conditional expectation function E(ys|y,) = a + fy, is 

linear, and the conditional variance V(y,|y;) = а? is constant. 
Concluding the properties of the N(p, X) distribution, we have: 


P5. Linear functions of a multinormal vector are multinormal: If z = 
g * Hy, where g and H are nonrandom, and H has full row rank, then 
z~ N(g + Hp, НУН). 


As in Section 7.4, the full-row-rank condition is required to rule out 
degeneracies, by ensuring that V(z) is nonsingular, a prerequisite for 
multinormality. To see the problem, suppose that y = (y, yo)’, and 
consider z — Hy, with 


Then z, = y, + ys = zs. So z, and zs, being linear functions of the 
bivariate normal vector y, are each univariate normal. But with z; = zs, 
their joint density lies entirely over the 45? line, rather than having the 
characteristic bell shape of the bivariate normal. Also, if y is п X 1 and 
H is m X n with m > n, then H cannot have full row rank, so z = Hy 
will not be multinormal. Nevertheless, any subvector of z that is ex- 
pressible as a full-row-rank linear function of y will be multinormal. 
Some texts speak of a degenerate multinormal distribution when 
rank(H) « m. 
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18.3. Functions of a Standard Normal Vector 


If the n X 1 random vector z is distributed N(0, Т), then we say that z 
is a standard normal vector. In that event, zı, . . . , 2, are independent 
standard normal variables. The multinormal pdf specializes to 


f(z) = (т) "^ exp(—z'z/2) = П [ехр( —22/2)/ (9т)], 
і=1 


which is indeed the product of n standard normal densities. 
We restate and extend the theory of Section 8.5 for functions of 
independent standard normal variables: 


(18.1) Іо = z'z, where the n X 1 vector z ~ N(O, Т), 
then ш ~ x°(n). 


That is, the sum of squares of n independent standard normal variables 
has the chi-square distribution with parameter n. 


(18.9) If = (w,/my(ws/n), where w, ~ x (m) and ws ~ x°(n) are 
independent, then v ~ F(m, n). 


That is, the ratio of two independent chi-square variables, each divided 
by its degrees of freedom, has the Snedecor F distribution, with numerator 
and denominator degrees of freedom equal to those of the respective 
chi-squares. The cdf of this distribution is tabulated in many textbooks. 


(18.8) If u = z/V(wln) where z ~ Х(0, 1) and w ~ x(n) are 
independent, then и ~ t(n). 


That is, the ratio of a standard normal variable to the square root of 
an independent chi-square variable divided by its degrees of freedom, 
has the Student’s ¢-distribution. The parameter is the same as the param- 
eter of the chi-square. Observe that if и ~ t(n), then и? ~ F(1, n), which 
parallels the result that if z ~ N(0, 1), then 2° ~ x'(1). 

All three arguments reverse. For example, if w ~ у (л), then ш is 
expressible as the sum of squares of n independent standard normal 
variables. 

Consider a sequence of random variables indexed by n. Two conve- 
nient asymptotic (in п) results follow: 


. (18.4) If v ~ F(m, n) then mv > x" (m). 
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Proof. Write v = (шут) (ит), so то = шит). Since wy/n = 
(1/n)Eg? can be interpreted as a sample mean in random sampling 
(sample size п) on the random variable 22, whose expectation is 1, the 
Там of Large Numbers says wọ/n > 1. So by S4 (Section 9.5), the variable 
mv has the same limiting distribution as w,/1, namely хот). ш 


(18.5) If u ~ п) then и > N(O, 1). 


Proof. Write и = z/V (шт). Since шт > 1, it follows that V(w/n) => 
V1 = 1. Sou has the same limiting distribution as 2/1, namely N(0, 1). m 


Consequently, if т is large, then one can rely on the chi-square and 
normal tables for approximate probabilities, rather than referring to 
the Snedecor F and Student's ¢ tables. 


18.4. Quadratic Forms in Normal Vectors 


We now establish the distributions of certain functions of a general 
multinormal vector, by reducing them to the functions of a standard 
normal vector introduced above. 


Q1. Suppose that the n х 1 vector y ~ N(p, X) Let w = 
(y - 9X (y – p). Then ш ~ (п). 


Proof. Yt suffices to show that w = z'z, where the n X 1 vector z is 
distributed N(0, I). The steps follow. 


(i) Since X is positive definite, we can write X = CAC’ where C is 
orthonormal (that is, CC’ = I = C'C) and A is diagonal with all 
diagonal elements positive. The diagonal elements of A are the 
characteristic roots (eigenvalues) of X, and the columns of С are 
the corresponding characteristic vectors (eigenvectors) of 2. 

(i) Let A* be the diagonal matrix whose diagonal elements are the 
reciprocal square roots of the corresponding diagonal elements 


of A. 
(ii) Let Н = CA*C'. Then Н = H, H'H = CA^!C' = X^!, and 
HXH'-IL | 


(v) Let e = y – p. Then є ~ X(0, X). 


(v) 
(vi) 


Q2. 
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Let z = He. Then z ~ (0, I). 
ш = є іє = e'H'He = (He)'(He) = zz. Ш 


Suppose that the n X | vector u ~ №0, I). Let M be a nonrandom 


n X n idempotent matrix with rank(M) = r = n. Let w = u'Mu. Then 
- у?! 
u--x'tr). 


Froof. It suffices to show that ш = 2121, where the r X 1 vector z, is 
distributed .N(0, I). The steps follow. 


(i) 
(i) 


(iii) 
(iv) 


(v) 


(vi) 


Since M is symmetric, M = CAC’, where C is orthonormal and 
A is diagonal. 

Since M is idempotent, its characteristic roots are either zeroes 
or ones. Since its rank is r, there are r unit roots and n — r zero 
roots. These roots are displayed on the diagonal of A, which 
without loss of generality can be arranged as 


A а е O,xn—1) ) 
OC x Om-nxin—n 


Partition C correspondingly as C = (C;, С,), where C, is n X r, 
and С, is n X (n — 7). 
Because C is orthonormal, we have 


CC’ = (C,, C3) [e = СС + GC = LI, 


Ci СС, СС I О 
С'С= 1 = 141 m2) — T : 
(&) (e. ca = (GG Gc) ae 
So 
CA a (C,, C;) o " mE (C;, O), 


М = CAC’ = (C,, О) ie = CC. 
C 
Because the r X n matrix С, is nonrandom with rank ғ, the 
т X 1 random vector z, = Суи is multinormal with 
E(z,) = CjE(u) = Cj0 = 0, 
Vœ) = CiV(u)C, = СИС, = СС, = L. 
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That is, z, ~ №0, I). 


(vii) w = u'Mu = и'(С,С,)и = (u'C)(Ciu) = ziz,. m 


Q3. Suppose that the n X 1 vector и ~ N(0, I). Let M be a nonrandom 
n X n idempotent matrix with rank(M) = r = n. Let L be a nonrandom 
matrix such that LM = O. Let t, = Mu and t, = Lu. Then t, and t, are 
independent random vectors. 


Proof. It suffices to show that t, and t, are respectively functions of 
two independent random vectors. The steps follow. 


(i) Using the construction of Q2, again let z, = Суи, and also let 
Z = Сои. The n X 1 random vector z = (21, 25)’ = C'u is standard 
normal, with z, and z being independent. 

(ii) Now М = СС}, so t, = Mu = (C,C))u = C,(Cyu) = Caz. 

(ii) Let N = I — M = СС. Then LN = L(I – M)-L- LM = L. 

(iv) So t = Lu = (LN)u = L(C;C5u = LC;(C;u) = LC,2z,. 

(v) Thus t, is a function of z,, and t; is a function of z,. Ш 


Exercises 


18.1 Suppose that y ~ N(p, X), with 


1 2 -1 1 
p212], Ў = { –1 5 1 
3 I 1 3 


(a) Calculate E(ys|yi, yo) and V(yslyi, ўә). 


(b) Find the best prediction of уз given that y; = 1 = yọ. 
(c) Calculate E(ys|y,) and V(ys|y1)- 
(d) Find the best prediction of уз given that y, = 1. 


(e) Find Pr(-1 = y, = 2). 
18.2 For the X matrix in Exercise 18.1: 


(a) Find the characteristic roots А,, Аз, As, and a corresponding set 
of orthonormal characteristic vectors €,, Co, Сз. 
(b) Verify that CAC’ = X. 
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GAUSS Hints 
(1) I£ S is an n X n symmetric matrix then the commands 
С = 0: r = eigsym(S “C”) 


will produce the n X 1 vector г of characteristic roots of S 
and the n X n matrix C of orthonormal characteristic vectors 


of S 


(2) If T 1s an n X n matrix and r ıs an n X 1 vector then the 
command 


T = diagrv(T,r), 


will replace the diagonal elements of T by the elements of r. 


19 Classical Normal Regression 


19.1. Classical Normal Regression Model 


We now strengthen the CR model by assuming that the random vector 
y is multivariate normally distributed. What results is the classical normal 
regression, or CNR, model, which consists of the assumptions 


(191) y ~ N(XB, o°D), 
(19.2) X nonstochastic, 
(19.3)  rank(X) = k. 


Recalling the properties of multinormality, the interpretation is that 
the random variables y,, . . . , y, are independent, with y; ~ (р, o°), 
where р; = х; В. (Caution: x; denotes the ith row of X', not the transpose 
of the ith column of X.) So the y;s are independent normal variables 
that differ in their means, but have the same variance. With respect to 
the underlying population, we have in mind a joint probability distri- 
bution for the random vector (y, xs, . . . , x,)’ in which the conditional 
distribution of y given the x's is 


|х, ett y X, == ХВ, + Boxe tee Ht Bi Xp, с?). 


The normality refers to the conditional distribution of y given the x’s. 
No normality assumption for the x's is being made, although it is true 
that if the joint distribution of (y, xs, . . . , x,)' is multinormal, then the 
conditional distribution above will automatically hold. With respect to 
the sampling, we continue to rely on the classical, stratified-on-x, scheme 
set out in Section 16.1. 
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19.2. Maximum Likelihood Estimation 


For estimation in the CNR model, we might consider the maximum 
likelihood approach. In our initial application of this approach (Section 
12.4), the data were randomly sampled from a single population. In the 
CNR model that feature is lacking, but ML estimation is still possible, 
because the pdf for the vector y is fully specified up to the parameters 
В and o°. 

Recall that if an n х 1 random vector y is distributed N(p, X), then 
its pdf is 


Ry) = (22) "| |"? exp(—u/2), 
where 
w-eX'e є=у-– р, |E] = пме). 


In the CNR model, y is distributed N(p, X) with p = ХВ and X = o'I. 
So 


X'-(Vo)ü, |Z] =)", є=у- ХВ, 
and the pdf simplifies to 
fly) = (т) "(0°)" exp[—e'e/(20?)]. 


As a pdf this is viewed as a function whose argument is y, with (the 
true values of) B and o°, and the observed nonrandom X, as givens. 
But we ran also read f(y) as a function (f, o°) whose arguments аге 
В and с? with y and X as givens. Doing so, we have the likelihood 
function for our sample. The ML estimates of В and о? are the values 
that maximize 2, or equivalently maximize its logarithm, 


L = ЦВ, о?) = log £ = —(n/2) log(2m) — (n/2) log(o?) — (1/2)e'e/o?. 


With e = y — XB, it is immediate that L is maximized with respect to 
В by minimizing e'e with respect to В. But that is just the least-squares 
criterion, so in the CNR model, the ML estimator of B is identical to 
the LS estimator b. 

Inserting the solution value for B makes e'e = e'e, with e = y — Xb, 
which leaves the "concentrated log-likelihood function," 


L*(o^) = L(b, o?) = —(n/2) log(27) — (n/2) log(o?) — (1/2)e'e/c?, 


to be maximized with respect to т”. The first derivative is 
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aL*/àóo? = —(n/2)/o? + (1/2e'e/o*. 


Equating àL*/9c^ to zero and solving gives the ML estimator of о? as 
e'e/n, which differs only slightly from our previous estimator 6”. So 
under the CNR model, ML estimation essentially coincides with LS 
estimation. 

Further, with the pdf specified up to parameters, one may evaluate 
the Cramér-Rao lower bound for the variance of unbiased estimators. 
Doing so would show that the LS coefficient vector b is the minimum 
variance unbiased estimator of B in the CNR model: see Judge et al. 
(1988, pp. 227—299) or Amemiya (1985, pp. 17—19). 


19.3. Sampling Distributions 


From a practicál point of view, the relevant implications of the CNR 
model are those that refer to the sampling distributions of the LS 
statistics b and e'e. We defer discussion of e'e until Section 21.1. 

The key distribution result in the CNR model is that the random 
vector b is multinormally distributed: 


D1. b~ N(B, o?^Q ). 

Proof. Recall that b = Ay, where A = Q'!X' is a constant k X n 
matrix. Multiplication by a nonsingular matrix preserves rank, so 
rank(A) = rank(X') = rank(X) = k. So b is a full-row-rank linear func- 


tion of the multinormal vector y, and hence b is multinormal by P5 
(Section 18.2). ш 


Any full-row-rank linear function of the multinormal vector b will 
also be multinormal. Thus: 


D2. Lett = Hb and 0 = НВ, where the p х k matrix H is nonrandom 
with rank(H) = р. Then t ~ N(0, ^D), where D = (НО ЇН”). 


As a special case we have: 


D3. Let b; be the jth element of b. Then b; ~ X(B;, 079”). 
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Proof. Take Н = h’, where h is the k x 1 vector with a 1 in the jth 
slot and 0’s elsewhere. Then НВ = Bj, Hb = b, HQ'H'- (0. ш 


The result D2 also subsumes other linear functions of the elements of 
b, such as р; = xib. Indeed it subsumes DI: just take Н = I. 

Now proceed to quadratic functions of b. Recall from Q1 (Section 
18.4) that if the n X 1 random vector y is distributed N(p, X), then the 
random variable w = (y — р) (y — p) is distributed x*(n). Applying 
this to D2 and D3 yields 


D4. ш = (t — 0)D(t — Ө)/0° ~ y'(), 


D5. ш = (b – B)" / (oq?) ~ Х?(1). 


19.4. Confidence Intervals 


We use the distribution results to construct exact confidence intervals 
and regions, supposing that о? is known. The more practical results, 
those that are operational when о? is unknown, are deferred until 
Chapter 21. | 

Rewrite D3 as 


D3A. z = (b — Bj/o,, ~ NO, 1), 


where о, = о?ай. Then by the logic of Section 11.5, b; + 1.960% is a 
95% confidence interval for the unknown parameter f. Intervals for 
different confidence levels can be constructed: for example, in place of 
1.96, use 1.645 to get a 9096 confidence interval, or 2.576 to get a 9996 
confidence interval. The higher the confidence level requested, the 
wider the interval. 

For a given confidence level, say 95%, the interval will be wide if оу 
is large, that is, if о? is large and/or 4 is large. Focus on the latter 
component, q? = 10%, where 


= D = 2 
d-x'x-xGj 


is the sum of squared residuals in the auxiliary regression of x; on all 
the other x's: see the Submatrix of Inverse Theorem (Section 17.5). The 


208 19 Classical Normal Regression 


confidence interval will be wide (ceteris paribus) if that sum of squared 
residuals is small. Confine attention to the leading case where f; is a 
slope coefficient in a regression that includes a constant. Then the 
auxiliary regression will also include a constant, so its coefficient of 
determination, R? say, is well defined. Hence 


g=- R) E e xy. 


So the interval will be wide (ceteris paribus) if 24x; — xy is small, 
and/or if А? is large. The latter says that collinearity of x; with the other 
x's tends to produce wide confidence intervals, a topic to which we 
return in Chapter 23. 

'To summarize, for an individual slope coefficient В„ the confidence 
interval for a given level will be wide—the estimate of B; will be impre- 
cise—if the population conditional variance of y is large, the variation 
of x; about its mean is small, and/or the auxiliary R? is large. 

The procedure developed here also applies to constructing a confi- 
dence interval for a single linear combination of the elements of B. Let 
Ө = h’B and t = h'b, where h is a nonrandom k X 1 vector. With р = 1, 
we can rewrite D2 as z = (t — 0)/с, ~ N(0, 1), where o? = o?^h'Q !h = 
h'V(b)h. A 95% confidence interval for the scalar parameter Ө is given 
by г + 1.966,. 


19.5. Confidence Regions 
Preliminaries 


Suppose that we are concerned with the parameter pair (0,, 05). From 
our sample, we have constructed the two 95% confidence intervals 


t, + 1.960,, tg + 1.960,,. 


The probability that the first interval covers the true 0, is 0.95, and the 
probability that the second interval covers the true 8, is 0.95. Can we 
combine these two intervals to get a 9596 confidence region for the pair 
(0,, 05)? The intersection of the two intervals in the (0, 0) plane is a 
rectangular region, a box. What is the probability that this random box 
covers the true parameter point (0,, 05)? Let A, be the event that the 
true 6, lies in the first interval, and let A, be the event that the true 0, 
lies in the second interval. Then A = A, П А, is the event that the true 
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point (6, 05) lies in the box. While Pr(A,) = 0.95 = Pr(A;), it does not 
follow that Pr(A) = 0.95. So the box is not a 95% joint confidence 
region. 

Indeed if A, and А, are independent events—that is, if ¢; and & are 
independent random variables—then Pr(A) = Pr(A;) Pr(Ag) = (0. 95)? = 
0.9025, and the box will be a 90.25% joint confidence region. In general, 
C(t, tg) is nonzero, so in general A, and А» are not independent events, 
so Pr(A) # 0.9025. It is possible to calculate Pr(A) from the BVN dis- 
tribution of t, and t, and thus to ascertain the exact confidence level of 
the box. By the same token it is possible to get an exact 9596 box by 
using an appropriate critical value in place of 1.96. Let Af be the event 
that the true 0, lies in the interval / + c*o,, let Až be the event that 
the true 6, lies in the interval tg + c*o,,, and let A* = Аў N Аў. Then 
from the BVN distribution of t, and ё, one can find c* such that 
Pr(A*) — 0.95, and thus obtain a box whose exact confidence level is 
95%. 

But an alternative approach to constructing joint confidence regions 
is simpler, and conventional as well. 


Joint Confidence Region 


Suppose that we are concerned with the р X 1 parameter vector Ө = 
Hf. Given a sample value of t, and knowledge of o°, we rely on D4 to 
propose 


(19.4) (0 — 0'D(0 - tyo* = c, 


as the 9596 confidence region for the Wakiona parameter vector 9. 
Here c, is the 5% critical value from the x *(p) table, that is, G,(c,) = 
0.95, where G,(-) is the cdf of the x *(p) distribution. For example, from 
Table A.2, c, = 3.84, cg = 5.99. The region consists of all p х 1 vectors 
0 that satisfy the inequality. Centered at the point t, the region is an 
ellipsoid, because the matrix D/o* is positive definite. Observe that here 
Ө denotes the argument of a function, not necessarily the true parameter 
point. 

The rationale for the proposal is clear. For arbitrary Ө, the left-hand 
side of the inequality (19.4) is a random ellipsoid, with center that varies 
from sample to sample as t varies. For the true Ө, the left-hand side of 
the inequality is the random variable 
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ш = (t — 0/D(t — 9yo?, 


which, according to D4, is distributed x?(p). The true parameter point 
9 will lie in or on the random ellipsoid iff w = c,. Let A be the event 
that w = c,. We conclude that Pr(A) = 0.95. Thus the probability that 
A occurs, that is, that the random ellipsoid covers the true Ө, is 0.95, 
That justifies saying that inequality (19.4) provides a 9595 confidence 
region for the parameter vector Ө. 

Regions for different confidence levels are constructed by using 
appropriate critical values from the x"(5) table. The higher the confi- 
dence level requested, the larger the critical value, and hence the larger 
the resulting ellipsoid. 

If we apply D4 for a single parameter Ө, we get the 95% confidence 
region ш = сү, where ш = (t — 0)?/o?. Now, ш = c, defines an interval 
on the real line, centered at t. But w = 2°, where z = (t — 0)/c,, and 
c, = 3.84 = (1.96)*. So ш = c, is equivalent to |z| = Ve, = 1.96. That 
is, the region ш = c, coincides with the interval і + 1.960, of Section 
19.4. 

It is worth noting that the w of D4 can be written as 


w-(t- 9'[V(0] (t — Ө), 


because (D/o?) = (с?р!) ! = [V(0] !. So ш is a natural generalization 
of the scalar 22 = (t — 0)?/o?. 


19.6. Shape of the Joint Confidence Region 


To study the joint confidence region and its relation to univariate con- 
fidence intervals, it is convenient to take the case p — 2, and further 
specialize as follows. Suppose that 


t i 0, lor 
6)" e) Cb 
where r lies between —1 and 1 to ensure positive definiteness. 
Here z; = t, — 0, and z = t, — Ө» are each distributed (0, 1), while 


w, = 22 and ws = 23 are each distributed ҳ?(1). The relevant 95% critical 
values are c, = 3.84 and с = Vc, = 1.96. The box, centered at the 
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sample point (t,, tẹ), which is defined by intersecting the intervals |z;| = 
c and |] = c, is not a 95% confidence region. However, consider the 
variable 


= (0 — tD(0 — t/o? = z'(D/o?)z, 


where z = (zi, Z9)’. With 
= 9—1 = ] r 
V(t) = ср ү "E 


we have 


Dio” = (D^)! = (1 - B 2 ; 
so w = (22 + 22 — 272,29)/(1 — r^). For the true Ө, the theory says that 
ш ~ x7(2), so a 95% confidence region for Ө is w = cə, where c, = 5.99, 
the 5% critical value from the x°(2) table. We can write the region as 


2 +22 – 272 5 Call — r°). 


In the 0,, Ө, plane, this is an ellipse centered at (і, tẹ). For conve- 
nience, let us translate the axes so the origin is now located at the point 
(t tg). The ellipse is centered at the origin in the 2}, zə plane. We 
illustrate the possibilities with two figures. 

Figure 19.1 refers to the case r = 0, which arises when the estimators 
are uncorrelated. The ellipse is just a circle, 


2 2 
21 + 2% 5 бо, 


centered at the origin with radius Vc, = V5.99 = 2.45. The box is a 
square centered at the origin, with each half-side equal to Ve. Look 
along a coordinate axis: because Veg > Ve, it is evident that there are 
points in the circle that are not in the square. Look along the 45° line 
emanating from the center, that is, along the гау z, = 22: the circle 
passes through the point z, = 2 = V(¢/2), while the northeast corner 
of the square is located at the point z, = z = Ve. Since V (c4/2) < Ма, 
it is evident that there are points in the square that are not in the circle. 
This exhibits the distinction between intersecting two univariate 95% 
confidence intervals and constructing a 95% joint confidence region. 
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Figure 19.1 Confidence ellipse and box: r = 0. 


Figure 19.2 refers to the case r = 0.6. Here we have a proper ellipse, 
zi + 22 — rz, S c(l — n), 


centered at the origin. The major axis of the ellipse runs along the 45° 
line, so its vertices can be located by setting zj = 2, = z, say, and solving 
the equation z? + z? — 2r? = c(l — т?) to get 2 = c(1 + 7/2. So the 
northeast vertex is located at the point z} = z = М[с,(1 + ry2]. The 
minor axis of the ellipse runs along the —45° line, so its vertices can be 
located by setting z, = z, and z = —z, say, and solving the equation 2? + 
2 — Qr2(—z) = ¢9(1 — r°). The solution is 2? = c,(1 — 7)/2, so the southeast 
vertex is located at the point z, = V[co(1 — r2], z2 = —V[co(1 — 0/2]. 
Specifically, with r — 0.6, the northeast vertex is located at (2.19, 2.19) 
and the southeast vertex is located at (1.09, —1.09). Here again we see 
points in the ellipse that are not in the square, and points in the square 
that are not in the ellipse. As compared with the circle that prevailed 
when r = 0, which intersected the 45° line at the point [V(co/2), 
V(co/2)] = (1.73, 1.73), the ellipse has been stretched out in one direc- 
tion and pulled in somewhat in the other. 
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Figure 19.2 Confidence ellipse and box: r = 0.6. 


Exercises 

19.1 The CNR model applies with k = 4, X'X = I, o? = 2, and 
В = 0. Let ¢ = ЪЪ. Find the number c such that Pr(t > c) = 0.10. 

19.2 The CNR model applies to E(y) = ХВ, with о? = 7 and 


me {4 1 
ХХ = (; À : 
A sample gives these LS estimates: b, = 3, b, = 2. Determine whether 


the point В, = 2, By = —2 lies within the 95% confidence region for В. 


19.3 The CNR model applies to E(y) = Вх, + Вх, with o? = 2 and 


wal? 2 
xx-(; 2). 
Your sample has 5, = 3, b, = 2. 


(a) Construct a 95% confidence interval for 0 = В, + Bo. 
(b) Construct a 90% confidence region for the pair (В;, B2). 


20  CNR Model: Hypothesis Testing 


20.1. Introduction 


We proceed to another type of statistical inference, the testing of 
hypotheses about the population parameter vector В. We suppose that 
the CNR model holds, so that 


y ~ ХХВ, c? D, X (n X k) nonstochastic, rank(X) = А. 


We continue under the assumption that с? is known, so that the distri- 
bution results of Chapter 19 are applicable. 


20.2. Test on a Single Parameter 


Suppose that we have a hypothesis about the jth regression coefficient, 
specifically the null hypothesis that B; = B?, where Bj is a specific number. 
We propose the following 5%-significance-level test against the alternative 
hypothesis that B; # 87. With a sample in hand, accept the null hypothesis 
if Bj lies within the 95% confidence interval for B;, namely b; + 1.960,; 
reject the null hypothesis otherwise. Equivalently, calculate the test statistic 


2? = (6; aa BPN Ts 
and compare its absolute value with the critical value 1.96. 


If |z| > 1.96, then reject the null hypothesis B; = Bj. 

If |z;| = 1.96, then accept the null hypothesis B; = fj. 
'The rationale is as follows. Think of bj, and hence ж, ава random 
variable rather than the value obtained in a particular sample. Let A be 
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the event { [2?| > 1.96}. The probability that A occurs depends on what 
the true value of В; is. If the true value is fj, so that the null hypothesis 
is true, then the random variable 27 is identical to the random variable 
z; defined in 


D3A. z= (b; — Bo, ~ N(O, 1). 
Consequently, Pr(A|B; = В?) = 0.05. So the significance level, namely 
the probability of rejecting the null hypothesis when it is true, is 5%. 

Suppose a sample has |z| > 1.96. If the null is true, then a low- 
probability event has occurred. The probability of the event is so low 
that its occurrence is taken to be evidence against the null; so the 
decision is to reject the null. Heuristically, the point estimate 5; is so far 
from the hypothesized parameter value 3? that it is implausible that b, 
has in fact been drawn from a distribution with expected value fj. 
However, finding a sample with |z;| = 1.96 is not surprising when the 
null is true, so then the decision is to accept the null. 

When |7| > 1.96, one says that b; is significantly different from Bj at the 
5% level; when |zj| = 1.96, one says that b; is not significantly different 
from Bj at the 5% level. 

Several lessons are immediate: 

* Rejection of the null is not proof that the null is false. After all, 
there is а nonzero probability of rejecting the null if it is true: Pr(|5] > 
1.96|B; = Bj) = 0.05. Loosely speaking, when the null is true, in 5% of 
the samples drawn from the population, the decision will be “reject the 
null.” 

* Acceptance of the null is not proof that the null is true. After all, 
different null hypotheses would also have been acceptable. Indeed if 
the null had been f; = В, where В? is any other point that happens 
to lie in the confidence interval b; + 1.960,, it too would have been 
accepted as a null hypothesis. 

* If o, is large, then the 95% confidence interval is wide, and widely 
diverse null hypotheses about В, are all acceptable at the 5% level. In 
that situation, the sample contains little information about the true value 
of Bj. The LS estimator 5; may well be the best estimator, but it need 
not be a precise estimator. 

The test procedure adapts to handle a null hypothesis about В, at 
different significance levels. Further, to test a null hypothesis about a 
single linear combination of the elements of B: accept iff the null 0? lies 
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in the confidence interval for 8, equivalently, iff the test statistic 2° = 
(t — 6°)/o, is in absolute value less than or equal to the critical value. 

It is good practice to use correct wording in reporting the outcome 
of a test: it is the estimate b, not the parameter B,, whose significance 
is being assessed; and the test is being conducted at a 5% significance 
level, say, пої at a 95% confidence level or at a 95% significance level. 


20.3. Test on a Set of Parameters 


Next suppose we have a joint null hypothesis about B, specifically about р 
linear functions of B. Let Ө = НВ, where the р х k nonrandom matrix 
H has rank f. Our null hypothesis is 


Ө = 0°, 


where Ө? is a specific numerical р X 1 vector. By appropriate choice of 
H, this subsumes the situations where the hypothesis concerns the full 
vector B, ога ko X 1 subvector fs, or a single element f. 

We propose the following 5%-significance-level test against the alter- 
native hypothesis that Ө # 0°. With a sample in hand, accept the null 
hypothesis if 0? lies within the 95% confidence region for 0 given by 


ш = (0 — 0'0(0 – tio’? = б; 


reject the null hypothesis otherwise. Here t — Hb, while c, is the 5% 
critical value from the X (p) table; that is, G,(c,) = 0.95 where G,(-) is 
the cdf of the x°(p) distribution. Equivalently, calculate the test statistic 


w° = (t — 0°)'D(t — Ө°)/о?. 


If w > с, then reject the null hypothesis Ө = 0°. 
If и? = c, then accept the null hypothesis Ө = 6°. 


The rationale is as follows. Think of u? as a random variable rather 
than as the value obtained in a particular sample. Let A = {w° > cJ. 
The probability that A occurs depends on the true value of Ө. If the 
null hypothesis is true, so that Ө = 6°, then the random variable w° is 
identical to the random variable w defined in 


D4. w= (t — 9)'D(t — 0yo? ~ x'(). 
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Because Pr(w > c) = 1 — G,(c,) = 0.05, we have Pr(A|0 = 0°) = 0.05, 
so the significance level, namely the probability of rejecting the null 
hypothesis when it is true, is 5%. Suppose that a sample has w° > c,. If 
the null hypothesis is true, then a low-probability event has occurred. 
The probability of the event is so low that its occurrence is taken to be 
evidence against the null. Heuristically, the point estimate t is so far 
from the hypothesized parameter value 0? that it is implausible that t 
has in fact been drawn from a distribution with expectation 0°. But 
finding a sample with u? = c, is not surprising when the null is true. 
Because V(t) = o?^D'!, we can write 


w? = (t — €'[V()] (t — 0°), 


which shows that w° measures the deviation t — 0? in the same way that 
2? measures the deviation b; — Qj, that is, relative to the variability of 
the estimator. 

The discussion of ellipses and boxes in Section 19.6 implies that one 
cannot tell the outcome of the test of a joint hypothesis from the out- 
comes of univariate tests of its separate components. It is quite possible 
to accept each of the separate null hypotheses 0; = 07 tested one by one, 
while rejecting the joint null hypothesis 0; = 07 (j = 1,..., f). There 
is no paradox here: the conjunction of two hypotheses may be unten- 
able, even though either hypothesis by itself is tenable. 


20.4. Power of the Test 


To test (at the 5% significance level) the null hypothesis 0 — 0? against 
the alternative Ө + 0°, we have proposed the test statistic 


w = (t — 9€yD(t — 05/c?, 


and the decision rule: reject the null iff w° > c, where c is the 5% critical 
value in the x7( p) table. Our rationale was that the event {w° > с} is rare 
when the null hypothesis is true: Pr[(w° > с)|0 = 0°] = 0.05. Why choose 
this particular rejection region, u? > c? Taking as the rejection region 
any other interval for w° whose probability is 0.05 when the null is true 
would also provide a test at the 5% significance level. Indeed, one might 
Just toss a fair 20-sided die and reject the null iff the “1” turns up; that 
would also provide a 5% significance level test of the null Ө = Ө”. 
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To understand our choice, one must consider the power function for 
the test, namely the probability of rejecting the null Ө = 6°, as a function 
of the true parameter value Ө. The “rare if null true" rationale for 
rejecting the null when the event ш? > c occurs would lose its force if it 
turned out that the event u? > c was even rarer when the null is false, 
that is, if it turned out that Pr[(z^ > с)|9 # 0°] < 0.05. Our claim is 
that 


Pr[(u? > c)|0] = 0.05, 


with equality iff Ө = 0°. That is, the power of the test, namely the proba- 
bility of rejecting the null, is everywhere greater than the significance 
level, except at Ө = 0°. Further, the power is increasing in a sensible 
measure of the distance between the hypothesized and true value of Ө. 
Tests that use a different rejection region may not have those desirable 
properties. For example, the power of the 20-sided-die test is every- 
where equal to its significance level. 

The argument rests on the distribution of и? as a function of Ө. We 
first focus on the expectation, showing that Е(ш°) increases as Ө departs 
from 0°. Define the random miss vector, 


m=t-— Ө, 
and rewrite the test statistic as 

w = m'(D/o?)m. 
Now E(m) = E(t) — © = HB — © = 0 — € = р, say, and V(m) = 
V(t) = c?^D'^!. (Caution: Do not confuse p = E(m) with p = E(y) = 
XB.) Use R5 (Section 15.1, with D/o? playing the role of T) to calculate 


E(u^) = tr[((D/o?)o?D '] + p'(D/o?)p. 
tr(I,) +p'(D/o*)p 
z b +p’ (D/o*)p. 


Because D/o? is positive definite, we conclude that E(w°) = p, with 
equality iff p. = 0, that is, iff Ө = 0°. The farther the null is from the 
truth, that is, the larger the magnitude of the p X 1 vector p, as 
measured by the nonnegative scalar p (D/o^)p, the larger is E(u^). Inci- 
dentally, the calculation so far does not rely on normality. 
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Finding that E(u?) rises as Ө departs from 0? makes it plausible that 
Pr(u? > c) also rises as Ө departs from 6°. But to establish the latter 
requires examination of the probability distribution of w° as a function 
of the parameter Ө. 


20.5. Noncentral Chi-square Distribution 
The relevant distribution theory starts at the level of Section 18.3: 


Suppose that the n х 1 random vector z is distributed N(@, I). Let 
ш = z'z. Then ш ~ x?^*(n, №), where А = a'a. 


That is, the sum of squares of n independent N(a,, 1) variables has the 
noncentral chi-square distribution with degrees of freedom parameter n 
and noncentrality parameter X? = Х,о?. The familiar (central) chi-square 
distribution is the special case that arises when а = 0. Table A.4 gives 
a small display of 1 — Сў(сь; А”), where G£(-; А?) denotes the cdf of the 
X**(k, №?) distribution, and c, is the critical value relevant for testing at 
the 5% significance level. That is, c, is defined by Сў{(сь; 0) = G,(c) = 
0.95; thus c, = 3.84, c, = 5.99, сз = 7.81. In each column of the table, 
we see that the probability of exceeding c, increases with №. (Caution: 
This table records the complement of the cdf, not a cdf itself as Table 
A.2 did.) 

Retracing the steps used in the proof of Q1 (Section 18.4), it follows 
immediately that: 


Q4. Suppose that the n X | vector y is distributed N(p, X). Let w = 
y'X !y. Then ш ~ x?*(n, X°), where А = рУ р. 


Now return to the CNR model. As a linear function of b, thep x 1 
miss vector, m = t — Ө”, is distributed N(p, с?р ), with р = Ө — 9*. 
Applying Q4 to our test statistic w° = m'(D/o?)m, gives а new distribu- 
tion result for the CNR model: 


w ~ x2*(p, А), with А? = (0 — 9 D(8 ~ 65/o?. 


So the power of our test is given by 
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Рг[(и° > с„)|Ө] = 1 — GE; M), with А = (0 — 0°)'0(0 — 65/o*. 


Observe that the probability distribution of the test statistic w° depends 
on 9 only through the scalar A’. For a given significance level, that is 
for a given c, the power in Ө-ѕрасе will be constant along ellipsoids 
centered at 0*. As Ө departs from 6^, that is, as we move farther out 
along a ray through Ө — 6°, the scalar А? increases, and so the power 
increases. Observe also that the power depends on the direction, not 
merely upon the magnitude, of Ө — Ө”. 

This argument completes the rationale that supports the joint hypoth- 
esis test procedure of Section 20.3. It also supports the single hypothesis 
test procedure of Section 20.2, because that, as we have seen, is equiv- 
alent to a joint test with p = 1. 

Actually, for p = 1, one can deduce the noncentral chi-square prob- 
abilities from the N(0, 1) cdf. The calculation runs as follows. Suppose 
w ~ x?*(1, №). Then w° = 2°, where 2° ~ N(A, 1), that is, (2° — А) ~ 
(0, 1). Let A = {w° > c) = A, U As, where 


Ay = {2 > Ve} = (P — N) > (Ve – X) 


Ag = {2 < — Vc) = (( — X) < -(Ve + А). 


Ш 


Now Pr(A;) = 1 — F(Vc — А) = F(A — У), and Pr(Ag) = F[-(A + М)], 
where F(.) denotes the N(0, 1) cdf. Because A, and А, are disjoint, we 
have 


Pr(u? > c) = Pr(A) = Pr(A,) + Pr(Ay) = F(A — Ve) + FI-A + Vo). 


For example, suppose A? = 1, and с = 3.84. Then А = 1 and Ve = 
1.96, so 


Pr(u? > с) = F(—0.96) + F(—2.96) = 0.168 + 0.002 = 0.170, 


as in Table A.4. 


Exercises 


20.1 The CNR model applies to E(y) = XB. You know that o? = 2 
and that 
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. {31 
xx «(5 1). 


In a sample of 32 observations, the LS coefficients are b = 2, b = 2. 


(a) Test at the 5% significance level the joint null hypothesis that 


В, =3 = Bs. 


(b) State the alternative hypothesis against which you are testing. 
20.2 The CNR model applies to E(y) = ХВ, with о? = 7 and 


ry _ (4 1) 
Х'Х = е 2]: 
The null hypothesis By = 1 will be tested at the 10% significance level, 
against the alternative that By # 1. What is the probability of rejecting 
that null hypothesis, if the true value of Bg is 3? 


20.3 The regression slope b in a CNR model is distributed N(B, 1). 
The null hypothesis 8 = 0 will be tested at the 10% significance level 
by using the statistic 2° = 6/o,. That is, the null will be rejected if and 
only if |z?] > 1.645. 


(a) Write and run a program that tabulates the power of the test at 
these 9 values of the true parameter В: 


-2 -1.5 -1 -05 0 05 1 15 2 


(b) Redo (a) for the situation where b ~ N(B, 4). 
(c) What do your two tables tell you about the effect of оў on the 
power of the test? 


GAUSS Hint: P , 
‘The command cdfn(c) evaluates the ‘standar d notmal cdf at the 
point C. - ts ds 


20.4 The pair of regression slopes 5,, b; in a CNR model is distributed 
BVN(B,, В», 1, 1, 7) with r = 0.6. The joint null hypothesis В, = 0, 
P2 = 0 will be tested at the 5% significance level by using the statistic 


ш = (bj + b3 — 2rb,by)(1 — r^. 
That is, the null will be rejected iff w° > 5.99. 
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(a) Write and run a program that tabulates the power of the test at 
these 9 values of the true parameter pair (B, Bə): 


-l1 0,1 1,1 
—1,0 0,0 1,0 
== 0, -1 3,1 
(b) Redo (a) for the situation where r = —0.6. 


(c) What do your two tables tell you about the effect of the sign of 
the correlation r on the power of the test? 


Tint 

and 1 caf i ee g Wes the cdf of the non¢enitral Chi : 
LAM ‘of freedom n and rióricént iy 

5m? ^at deg int с Рог “the tests m (a) аде). the | 


a art araméte 15 
muty p er, А 


№ = (B? + BF - oria Eg 


21 CNR Model: Inference with o^ Unknown 


21.1. Distribution Theory 


Thus far, the procedures for constructing confidence intervals and 
regions for B, and for testing hypotheses about its elements, have 
required knowledge of o°. We now extend the theory to obtain proce- 
dures that are operational in practice, where o? is unknown. It is natural 
to use 

ô? = e'e/(n — k) 
in place of c^, and thus to use 

A = (bj == Bjy6,, 
in place of z; = (b; — Bj)/a,, and 

@ = {t ~ 0)’D(t — 0yo? 
in place of w = (t — 0)'D(t — 0)/o?. To assess the distribution of the 
new statistics, we draw on some additional implications of the CNR 
model. 


For the CNR model in which y ~ (Xf, o°I), the relevant theory 
resumes with 


D6. ш, = e'elo? ~ x*(n — k). 


Proof. Recall the theory of Chapter 18 on functions of normal vectors, 
and take these steps: 


(i) Lete = y — XB. So e ~ X(0, oI) by P5 (Section 18.2). 
(i) Let u = (l/a)e. So the n X 1 vector u ~ N(0, Y). 


224 21 CNR: Inference 


(i) Rewrite y = ХВ + easy = XB + си. 

(iv) Then e = My = М(ХВ + cu) = oMa, using MX = О. 
(у) So e’e = o*u'Mu, and w, = e'e/o? = u'Mu. 

(vi) The nonrandom matrix M is idempotent with rank n — k. 
(vii) So by Q2 (Section 18.4), w, ~ x?(n — k). m 


Next, 
D7. The random vectors b and e are independent. 


Proof. Continue the construction above. The steps: 


( b= Ay = А(ХВ + ou) = В + cAu, using AX = I. 

(ii) Lett, = (1/o)e = Mu and t, = (l/o)(b — В) = Au. 

(iii) Since AM = O, the conditions of Q3 (Section 18.4) are met, so 
t; and t, are independent. 

(iv) So e = ot, and = B + ot, are independent. E 


As a corollary, we have that any function of e is independent of any 
function of b. Specifically, 


D8. Each of the statistics e, e'e, ш, 6°, Vib), is independent of each 


of the statistics b, bj, t = Hb, w, w,. 


Now turn to the statistics that use the estimator ó? instead of the 
population parameter o°. Let v = á/p. Then 


D9. v= (t 0)'0( — 0)(p6^) ~ F(p,n — k). 


Proof. Recall from Eq. (18.2) the requirement for a random variable 
to have the Snedecor F-distribution. It suffices to show that v is the ratio 


of two independent chi-square variables, each divided by its degrees of 
freedom parameter. The steps: 


(i) 67/0” = [e'e/(n — k)yo? = (e'elo?)(n — k) = wJ/(n — k). 
(ii) v = d/p = (wp?) = (wip)qw,/(n — k)]. 
(iii) But w ~ x°(p) is independent of w, ~ y(n — k). m 
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Continuing, let и, = 2. Then 


Proof. Recall from Eq. (18.3) the requirement for a random variable 
to have Student’s ¢-distribution. It suffices to show that и, is the ratio of 
a standard normal variable to the square root of an independent chi- 
square variable divided by its parameter. The steps: 


(i) (b; — Bjyo, = z. 

(ii) 6,/o, = VIGN] = У(82/0°) = V[w,/(n — E. 
(iii) и, = 16 — Bo, V6, /o,] = z/Viw Mn — ®)]. 

(iv) But z; ~ (0, 1) is independent of w, ~ x(n- Б). ш 


21.2. Confidence Intervals and Regions 


Under the CNR model, to construct confidence intervals and regions 
when с? is unknown, one uses the ¢ and F distribution results in the 
same way that the N(0, 1) and x? distribution results would be used 
were с? known. So the following discussion can be concise. 


Confidence Intervals 


For a single regression coefficient, draw on D10. Let c be the two-tail 
5% critical value in the t(n — k) table; that is, G(c) = 0.975, where G(-) 
is the cdf of the t(n — k) distribution. Then b; + có, provides a 95% 
confidence interval for В;. The rationale: the event that the random 
variable 5, lies within có; of the fixed parameter D; is 


А = {В,— cô, = b; = B; + c6,} = (16; - В)/б„| Sc} = (ul = с}. 


Since D10 says that и; ~ i(n — Б), we conclude that Pr(A) = G(c) — 
G(—c) = 0.975 — (1 — 0.975) = 0.95. 

Alternatively, we might draw on D9, specialized to p = 1. Let d be 
the 5% critical value in the F(1, n — k) table; that is, G,(d) = 0.95, where 
G,(-) is the cdf of the F(1, n — &) distribution. Then a 95% confidence 
interval will consist of all values В, satisfying the inequality v; = d, where 
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v = (b; — B16; 


Let B = (v; = d} be the event that the true 8; lies in this interval. Now, 
B is identical to A = {|u,| = c) because v; = |u,|? and d = c, so this is 
the same interval we got from D10. 

In the same manner, to construct a confidence interval for a single 
linear function of coefficients Ө = h’B, draw on D9 (with p = 1), or its 
t(n — К) equivalent. 


Joint Confidence Regions 


Similarly for joint confidence regions: suppose we are concerned with 
the p X 1 parameter vector 0 = Hf. Let d, be the 5% critical value of 
the F(p, n — k) distribution; that is G,(d,) = 0.95, where G,(.) is the cdf 
of the F(p, n — k) distribution. We rely on D9 to propose 


(0 — 0'0(0 — tps") = d, 


as the 95% confidence region for the unknown parameter vector 0. The 
region consists of all p X 1 vectors Ө that satisfy the inequality. Observe 
that here @ denotes the argument of a function, not necessarily the true 
parameter vector. The rationale for the proposal is that the true param- 
eter vector Ө lies in that random region iff v = d,, an event whose 
probability is 0.95. 

It is instructive to rewrite this operational confidence region v 5 d, as 


Ф = (0 — t)'D(O — 0/6? = pd,. 


Observe its similarity to the ellipsoidal region w = c, that would be used 
were о? known, namely 


w = (0 – t'D(0 - tyc^ x с. 


The region v = d, is also an ellipsoid centered at t, and indeed has 
precisely the same shape as the region ш = c,, being merely expanded. 
So our previous analysis (Section 19.6) of the shape of the region, and 
of its relation to the rectangular region obtained by intersecting single- 
parameter confidence intervals, carries over directly. 
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21.3. Hypothesis Tests 


The theory just developed for constructing confidence intervals and 
regions for В adapts to testing hypotheses about В when о? is unknown. 


Single Parameter 


Suppose that we have a null hypothesis about the jth regression coeffi- 
cient, namely B; = Bj, where В; is a specific number. At the 5% signifi- 
cance level, accept the null iff В? lies in the 95% confidence interval for 
B;. Equivalently, calculate the test statistic 


иў = (b; — Вў)/бь, 


and compare it with c, the two-tail 5% critical value in the п — k) table. 
If | | > c, then reject the null hypothesis B; = 87. If |u| = c, then 
accept the null hypothesis В; = $7. 

In presenting the results of an empirical study, a correct practice is 
to report the regression coefficients 5; along with their standard errors 
б,. This gives readers the information they need to construct a confi- 
dence interval for each regression coefficient, and to test a hypothesis 
about any one of them. It is common practice to report the regression 
coefficients along with their "t-ratios" or "t-statistics," the uj = 616, and 
to say, if uj is large, that 5; is "significant," meaning “significantly dif- 
ferent from zero." This common practice is not a good one, because it 
encourages readers to consider only “zero null hypotheses" В, = 0, which 
are not necessarily the interesting ones. Of course, a knowledgeable 
reader can always unscramble b, and uj to recover 6,. 


Set of Parameters 


Suppose we have a joint hypothesis, one that concerns several param- 
eters, specifically p linear functions of B. Let Ө = НВ, where the p x k 
nonrandom matrix Н has rank р. Our null hypothesis is Ө = 0°, where 
0° is a numerical р X 1 vector. By appropriate choice of Н, this setup 
subsumes the situation where the hypothesis is about the full vector В, 
or about a А X 1 subvector В», or about a single element B,. 

To test at the 5% significance level, against the alternative Ө # 0°, 
accept the null if 0? lies in the 95% confidence region for Ө, reject 
otherwise. Equivalently, calculate the test statistic 
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v= (t = 6°)'Dit = 0°)/( pd"), 


and compare it with d,, the 5% critical value from the F(p, n — &) table. 
If v? > d,, then reject the null hypothesis Ө = Ө”. If v^ x d,, then accept 
the null hypothesis Ө = 0°. 

The power function for the F-test is very similar to that for the chi- 
square test discussed in Sections 20.4 and 20.5, so we need not discuss 
it explicitly. 


21.4. Zero Null Subvector Hypothesis 


A leading special case of a joint hypothesis arises when the null says 
that several of the В; are zero. We refer to this as а zero null subvector 
hypothesis. Without loss of generality, let those B;s be the last ky ele- 
ments of B. Partitioning as 


B 


E(y) = XB = (X,, Xj) ls ) = Х.В, + Х.В, 


1 
2 
we state the null as В, = 0. To fit it into the framework of Section 21.3, 
set p = ka, Н = (О, 1) where the О is ky X ky, the I is ky X ko, and 0? = 
0. Then 


9-HB-B8, t= НЬ =, 


E : 11 12 О 
HQ н = (O, Y) (8. О) (2) = Q”, 
D -— (HQ'!H»)' = (Q: = Qs = Xi'Xf, 
with X$ = М,Х,. So the test statistic v? becomes 


v = 0%)! (k56^). 


Using Residual Sums of Squares 


There is another way to calculate this test statistic. Recall from Eq. 
(17.6) that 


b;Qáb, = b;Xi'Xfb, = e*'e* — e'e, 


where e*'e* is the sum of squared residuals from the short regression 
of y on X, alone, while e'e is the sum of squared residuals from the 


21.4 Zero Null Subvector 229 


long regression of y on X = (Xj, Xo). Recall also that 6* = е'е/(п — k). 
So the test statistic can be written as 


o (n-k) (== = ее) 
LN reru i с=с ане тшт, „ 
ko e'e 


This version may be computationally convenient. To get the test 
statistic for a zero null hypothesis В, = 0, there is no need to extract 
the ka X ko submatrix V(b) = 6*Q?? from V(b), and invert it. Instead, 
just run the appropriate long and short regressions and use their sums 
of squared residuals. 

The result is also analytically instructive. Large values of v? lead to 
rejection of the null, and v? will be large (ceteris paribus) when the ratio 
(e*'e* — e'eye'e is large. That is to say, the null hypothesis В, = 0 is 
rejected when dropping X; from the regression leads to a large pro- 
portional increase in the sum of squared residuals, that is, to a substan- 
tial worsening of the fit. This is quite natural. In terms of the underlying 
population, suppose that the model has 


E(y|xs, ..., xa) = В, + Box +... + Baxs, 
while the null hypothesis is Bg = 8; = Bg = 0, that is, 
E(y| xg, ... xg) = Bi + Вә +... + Вх. 


This null says that the conditional expectation of the dependent variable 
y given all the x's does not in fact vary with xe, xz, xg. So the null imposes 
a restriction on the CEF. To test the null, impose its restriction in 
estimating the CEF, by running the short regression, and see whether 
the fit is much worse than the fit obtained when the restriction was not 
imposed. If the fit is not much worse, then accept the hypothesis that 
the conditional expectation of y given all the x's does not in fact vary 
with х5, Ху, Хв. 


Using Е?” 


Continuing, suppose that X, contains the summer vector (and perhaps 
more columns). Then both the short and long regressions contain an 


intercept, and so their R”s are well defined. For the long regression we 
have 


e'e = (1 - RF) X (y – у), 
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and for the short regression we have 
e*'e* = (1 — R9) X (y, - yy. 
So 
(e*'e* — e'eye'e = (А? — R?sy(1 — В?) = AR*/(1 — R5, 


where AR? is the reduction in А? that occurs when X; is dropped from 
the list of explanatory variables. For this standard situation, then, the 
test statistic can be written as 


p= 2B ( am.) 
M E Alm] 


We now see that f, = 0 is rejected when dropping X, from the regres- 
sion leads to a large decrease in R’, relative to 1 — А? (especially when 
n — kis large and/or k is small). This version may be computationally 
convenient: one needs to record only the R*'s of the short and long 
regressions. 


All Slopes Zero 


Finally suppose that X, contains only the summer vector, so В, contains 
only the intercept. Then the null В, = 0 asserts that all of the slopes 
Bs. ..., B, are zero. Here the short-regression sum of squared residuals 
is e*'e* = Dy; — yy, so the short-regression R?* is in effect zero. So 
АЕ? = R?, and the test statistic simplifies to 


e-a Ex) 
"U7&-Du-RJ: 


Evidently, an “all slopes zero" null hypothesis will be rejected when А? 
is large, especially when n is large and/or k is small. 

The all-slopes-zero test is sometimes referred to as the "test of signif- 
icance of the complete regression" or the "overall significance test of 
the regression." Many packaged computer programs routinely calculate 
this statistic and report it as the "regression F-statistic." As a result, in 
many journal articles and textbooks it is routinely reported along with 
R. Typically the regression F-statistic is large, and one sees dramatic 
statements. To take a textbook example, Intriligator (1978, pp. 138- 
141), with n = 12 annual observations, regresses GNP on k = 3 explan- 
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atory variables (the constant, lagged GNP, and government expendi- 
tures). The R? is 0.9958, so v? = 1072, while the 5% critical value from 
the F(2,9) table is d = 4.26. He writes “it is clear that the overall 
regression is highly significant and that the hypothesis that all [slope] 
coefficients are zero is overwhelmingly rejected" (p. 140). A less dra- 
matic description of the situation would run as follows. The economic 
model, which allows р, = E(Y;) to vary linearly with Y;., and G, fits 
much better than a model that insists that p. = p for all ¿. In short, the 
economic model is “much better than nothing." 


Exercises 


21.1 Consider the special case of the CNR model in which the only 
explanatory variable is the constant. Use the present distribution results 
to derive F1—F4 of Section 8.6, the theorems on the distribution of the 
sample mean and variance in random sampling from a univariate 
normal distribution. 


21.2 The CNR model applies to E(y) = ХВ, with o? = 1 and 


EL imate 1 
p= (7), xx-( a) 
For each of the following two samples, determine the best guess of the 


random variable b, justifying your answer. 


(a) A sample has e’e = 10. 
(b) A sample has 5, = 3. 


21.3 Here are the results of two regressions run on annual time series 
for the years 1935—1978: 


(i) $= 50 + 0.2x + 0.5x, — 2.0x,, R? = 0.80. 
(ii) $ = 100 + 0.3x + 0.4x5, R? = 0.76. 


Determine whether the following is true or false: in equation (i), the 
standard error for 5, is 1/V2. 


21.4 The CNR model applies to E(y) = xiB,; + хә». A sample of size 
n = 102 gives these statistics: 
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s»). -(2 1), cease 


Let 0 = B, — В». Test at the 5% significance level, the null hypothesis 
that Ө = 


21.5 The CNR model applies to E(y) = ХВ with 


^. {4 2 
xx (1 2). 
A sample of size 32 gives b, = 2, b, = 5, e'e = 60. Construct a 90% 
confidence interval for В, — Bı. 


21.6 Suppose that the CNR model applies to the earnings function 
E(y|xis «5 X9) = Bi + Boxe + Baxs + Baxa + Baxs + Boxe 
+ Вох + Ваха, 
estimated in Exercise 17.4. 


(a) Calculate b, e'e, 6”, and V(b). 

(b) Report a 95% confidence interval for the “effect of education,” 
namely Bo. 

(c) Report a 95% joint confidence region for the “effects of experi- 
ence," Bs and B4. 

(d) Let Ө = B, + 28,%3, where xy is the sample mean experience 
(treated as nonrandom). Give an interpretation of, and a 90% 
confidence interval for, Ө. 

(e) Test, at the 596 significance level, the null hypothesis that "race 
does not affect earnings,” that is, B; = 0. 

(f) Test, at the 5% significance level, the joint null hypothesis that 
"region does not affect earnings," that is, Bg = B; = Bg = 0. 


GAUSS:Hint 
TES is amaii, 
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EAS a rad of integers and cis a vector of integers, 
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22  lssues in Hypothesis Testing 


22.1. Introduction 


In this chapter, we take up a variety of practical and procedural topics 
associated with hypothesis testing. Among the topics are: the conversion 
of general hypotheses into the zero-null-subvector form, the choice of 
significance level, testing against one-sided alternatives, the abuse of 
tests, and inference when the normality assumption is not adopted. 


22.2. General Linear Hypothesis 


In general, a linear hypothesis takes the form Ө = 6°, where Ө = Hf, 

H is p X k nonrandom with rank f, and 0? is numerical. In Section 21.4, 

we focused on the special case Bs = 0, but other cases arise in practice. 
For example, suppose that we have the demand function 


E(Y|X;, Xs, X4) = В.Х, + ВХ, + BsX3 + В.Х, 


where Y = log quantity of butter, X, = 1, X = log real income, Хз = 
log butter price, X, — log margarine price. We entertain the hypothesis 
that only the ratio of the two prices, not their separate levels, matters 
to the consumer. This says that the two log-price slopes are equal in 
magnitude but opposite in sign, that is, 8, = —B3. This hypothesis is 
expressed as 0 = B, + By, 0° = 0. 

For another example, consider the Cobb-Douglas production func- 
tion, 


E(Y|K, L, N) = В, + ВК + BsL + BN, 
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with Y = log output, K = log capital, L = log land, N = log labor. The 
hypothesis of constant returns to scale says that the sum of the log-input 
slopes (the elasticities) is unity. It takes the form Ө = В, + Вз + B4 with 
0° = 1. 

For a third example, consider а macroeconomic consumption func- 
tion, with E(Y|X) = o, + y;X in wartime, E(Y|X) = e$ + YX in 
peacetime. Here Y = consumption, and X = income. Defining the 
dummy variable Z, which equals 1 if war and equals 0 if peace, permits 
us to write the two functions together as 


E(Y|X, 2) = 02 * yiZX + os(1 — Z) + ү(1 — Z)X 
= Bix, + Boxe + Bsxs + Ваха, 


say. The null hypothesis that the function is the same in war and peace 
is a joint linear hypothesis: B, = Bs and By = B4. It is expressible as Ө = 
0°, with 


0 к 9-0) 


For the general linear hypothesis іп the CNR model, we saw (Section 
21.3) that the F-test statistic is 


v^ = (t- €'D( - 0°)/(p6"), 


where t = Hb and D = (H'Q"'H) '. For the zero-null-subvector special 
case, we saw (Section 21.4) that the numerator of this test statistic could 
be written as 


(t E 6°)’Dit uz 0°) = b,Q3,b. = e*'e* =" e'e, 


where e*'e* is the sum of squared residuals from the short regression 
of y on X, alone, and e'e is the sum of squared residuals from the long 
regression of y on X = (X,, Xj). 

In fact any linear hypothesis can be converted into a zero-null-sub- 
vector hypothesis, so that the computational convenience of short and 
long regressions is available in general. 

Start with 0 — Hf, where the matrix H is p X k with rank p. We may 
suppose without loss of generality that the last p columns of H are 
linearly independent: that is, partitioning Н = (H,, Hy), the p X р 
submatrix Н» is nonsingular. Partition X and В conformably as X = 
(X, Xə) and В = (Bi, B5)'. Define the k X k matrix 
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( Y о 
E mm 2 


where the partitioning is into k — рапа р rows and columns. Its inverse 
is 


au I OY 
кр к "E 
Let 
Z = XT = (X,, ХТ = (X, — X-H; H, X) = (21, Zo), 


which is ann X & rank-k nonrandom matrix, interpretable as the matrix 
of observations on transformed explanatory variables. Also let 


seen (B) = (нон +) (шщ), 


which is a k X 1 vector of transformed parameters. 
Now 


ХВ = X(TT bf = (XTI)(T^!B) = Za = Z;o, + 20, 
so E(y) = XB is equivalent to 
(22.1)  E(y) = Za, + 2505. 
Further, 

0 = НВ = H(To) = (Hj, Н,)То = Н,а,, 


so Ө = 6? is equivalent to Hæ, = 0°, that is, to a, = Hz'@° = o5, say. 
Апа @ = оз is equivalent to saying that Eq. (22.1) can be written as 


(22.2) Е(у) = Za, + Za. 


Let y? = y — 2,05, which may be interpreted as a vector of observations 
on a transformed dependent variable. Then Eq. (22.1) is equivalent to 


Е(у°) = Za, + 250$, 
where a} = os — аз, while Eq. (22.2) is equivalent to 
Е(у°) = 201. 


We have translated a general linear hypothesis into a zero-null-sub- 
vector hypothesis. Because Z — XT with T nonsingular, regressing y on 
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Z, or y? on Z, gives the same sum of squared residuals as regressing y 
on X: see Exercise 17.2. So the kernel of the test statistic for HB = 0 
can be calculated as e*'e* — e'e, where e'e is obtained from the long 
regression of y on Z (or equivalently of y? on Z, or of y on X), and 
e*'e* is obtained from the short regression of y? on Z}. 

All this is easier done than said. For example, for the butter demand 
equation, the restriction В, = – В; says that 


E(y) = xiB; + х8  xsBs + X4B4 
х.В, + xoBo + (хз — X4)B3 


ZB, + 28 + 2363, 


if 


If 


with 2) = Xj, 25 = Xs, Z4 = Xs — X4. The restricted regression is 
implemented by running y on 2 = (21, 25, 23). 

For the Cobb-Douglas production model, the constant-returns-to- 
scale restriction may be implemented by regressing у = Y – N = 
log(output/labor) on z, = 1, z = К — N = log(capital/labor), and ғ; = 
L — N = log(land/labor). 

For the war and peace consumption functions, the equal-slope restric- 
tion By = В, says that 


E(y) = xiBi + xoBo + xsB3 + хг. 
= х8, + (xo + x4)Bo + xsBs 
= 2B; + 2:35 + 2303, 
with 2р = Х|, Z = X9 + X, 23 = Xs. This restricted regression is 


implemented by running y on 2 = (21, 25, 23). If the equal-intercept 
restriction B, = В; is also imposed, write 


E(y) = wiBi + woe, 


with w; = Z; + 25 = X, + X3, Wo = Zp = Xo + x4. Then run y on W = 
(wi, W2) to impose the second restriction as well. 

The approach we have been using amounts to solving out the hypoth- 
esized coefficient restrictions to get a shorter regression that can be 
fitted by unrestricted least squares. Since the approach is feasible for 
any general linear hypothesis, in practice there is no need to calculate 
(t — 0) D(t — 0°) directly to get the test statistic v^. Just use the residual 
sums of squares from the long, and an appropriate short, regression. 
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Here is a final example to show how the solving-out-the-restrictions 
approach works. Suppose that you have a pair of data sets to which 


Е(уџ) = Х.В, E(y2) = х,В, 


apply, where y, is n; X 1, X, is nj X k, y2 is no X 1, and X, is n, X Ё. 
You want to test the null hypothesis B, = Bs. Assemble the data together 


as 
= pay = m = iS a ts = 
up ea XB. о х,/ \В A 
say. If the null В, = By (= B°, say) is true, then 
By) = (X) grax, 


say. The relevant sums of squared residuals are obtainable from a long 
regression (y on the 24 columns of X) and a short regression (y on the 
k columns of X^). Provided that the CNR model applies to each sample, 
with the same o°, while the two samples are independent, the difference 
between those sums of squared residuals is the kernel of the appropriate 
F-statistic. This special case of a standard F-test is sometimes referred 
to as a "test for structural change," or as a "Chow test." Incidentally, 
the long-regression sum of squared residuals can be calculated by 
adding together the sums of squared residuals obtained in separate LS 
regressions of y, on X,, and у, on Xp. 


22.3. One-Sided Alternatives 


To test the single-parameter null hypothesis В; = |В? against the alter- 
native that В; # B?, we have learned to use the t-statistic given by uj = 
(b; — Bj)/6,,, rejecting the null iff |u?] > c, where G(c) = 0.975 with G(-) 
being the cdf of the t(n — k) distribution. 

Now suppose, as occurs in some economic contexts, that the known 
alternative to B, = B? is one-sided, say B; > Bj. A one-tailed version of the 
t-test can be used: reject the null B; = Bj iff u; > c*, where G(c*) = 0.95. 
This variant is sensible. Heuristically, it would be foolish to reject B; — 
Bj in favor of В; > В? when the sample has 5; < B?. More formally, the 
one-tailed test has more power than the two-tailed test, for all B; > B? 
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—which is the only region where power is wanted in the present situa. 
tion: see Exercise 22.4. 

We have learned, as an equivalent to the t-test, the F-test that uses 
the statistic v? — (b; — p;/65, rejecting the null if vj > d where G,(d) = 
0.95, with G,(-) being the cdf of the F(1, п — X) distribution. The two 
approaches are equivalent because v? = (uj) and d = c". But the F- 
statistic yj = (и)? disregards the sign of 5b; — (9j, so it is not attractive 
for use when the alternative is one-sided. 

For a joint hypothesis with one-sided alternatives, no t-test is available. 
The F-statistic 


0° = (t — Ө°)'Ю(є — 9)/(565), 


treats positive and negative misses symmetrically, so it is not attractive 
for tests against one-sided alternatives. For a discussion of appropriate 
procedures, see Gouriéroux et al. (1982) and Wolak (1987). 


22.4. Choice of Significance Level 


Suppose that you are asked to test the null hypothesis B; — 0 against 
the alternative B; # 0, in a sample with n — k = 120. You obtain the 
test statistic и? = 1.82. Critical values from the N(0, 1) table are c = 1.96 
at the 5% level and c = 1.64 at the 10% level. With 1.64 « 1.82 « 1.96, 
the null would be accepted at the 596 level, but rejected at the 1096 
level. The same piece of evidence that will accept B; = 0 at the 5% level 
will reject it at the 1096 level. The interval between 1.64 and 1.96 is a 
"zone of opportunity." Indeed, whatever numerical value the sample 
delivers, a diligent researcher can force acceptance by setting the sig- 
nificance level low enough (e.g., 1% or 0.5%) or can force rejection by 
setting the significance level high enough (e.g., 10% or 20%). 

How should a researcher choose the significance level? Econometrics 
texts offer little, if any, guidance. In statistics texts, the discussion focuses 
on the power of the test—the probability of rejecting the null hypothesis 
as a function of the true parameter value. 

Generally power declines as the significance level declines: see Exer- 
cise 22.4. Moving from the 5% to the 1% significance level not only 
reduces the probability of rejecting a true null, but also reduces the 
probability of rejecting a false null. The first reduction is desirable, the 
second is undesirable. 
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There is a trade-off. To resolve the trade-off, statistics texts recom- 
mend a cost-benefit calculation: if the net cost of accepting a false null 
is less than the net cost of rejecting a true null, then choose a low 
significance level. Although this cost-benefit approach should be con- 
genial to economists, the 5% level is almost always used in the empirical 
economics literature. It is hardly plausible that distinct cost-benefit cal- 
culations underlie that ubiquitous level. Occasionally, the 10% and 1% 
levels are used. Reading closely, you may well be able to spot the occa- 
sions on which those levels replace 596. If an author really wants to 
accept the null, she may switch to the 1% level; if an author really wants 
to reject the null, he may switch to the 10% level. When such switches 
do not suffice, you may see such language as "barely significant at the 
1% level" (a hint that the author really wants to accept) or “almost 
significant at the 1096 level" (a hint that the author really wants to 
reject). 

This state of affairs may seem very unsatisfactory, but the textbook 
recommendation of a cost-benefit calculation is not appealing either. 
For academic research reports, neither the costs nor the benefits of the 
test decision are clear. It is rare for an economic agent to undertake 
real-world action upon reading a test outcome reported in a journal 
article. At most what may happen is that readers' beliefs shift in the 
light of the evidence. So, in almost all applied economic contexts, the 
significance level is necessarily a matter of convention rather than of 
calculation. 

It follows that readers should not take àn author's announcement of 
significance or nonsignificance as authoritative. Regardless of the 
author's choice of significance level and announcement of a decision, 
sensible readers will have to decide for themselves whether the evidence 
is weighty or fragile. Regardless of how the author phrases the test 
decision, the burden remains on readers to assess whether the sample 
evidence against the null (the magnitude of the test statistic) is strong 
enough to induce a change in their beliefs. 

A couple of lessons for writers emerge: 

* It is usually bad practice to say "significant [or nonsignificant] at the 
5% level,” without reporting the magnitude of the test statistic. (It is 
even worse practice to announce "significance" or "nonsignificance" 
without specifying a null hypothesis. In particular, the zero null may 
not be the interesting null.) 

* A useful alternative to the test statistic is a report of its “P-value,” 
or "marginal significance level," which is the level at which the observed 
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test statistic would be just significant. For example, suppose that a x (p) 
test is conducted, the cdf being G,(-). If u? is the observed test statistic, 
then its P-value is à? = 1 — G,(w°). The null would be rejected at all 
significance levels higher than o^, and accepted at all significance levels 
lower than a°. So the P-value gives readers more information than is 
contained in the binary report "accept" or "reject." 


22.5. Statistical versus Economic Significance 


A strong case can be made that hypothesis testing is widely abused in 
empirical economics: see McCloskey (1985). In many research reports, 
the author's conclusions emphasize the statistical significance, rather 
than the economic significance, of the coefficient estimates. Yet, a coef- 
ficient estimate may be "very significantly different from unity" (by the 
t-test), while that difference is economically trivial. Or the difference 
may be "not significantly different from unity" but have an economically 
substantial magnitude. 

It is certainly desirable to know how reliable a coefficient estimate is, 
that is, to know its standard error. But that desirability does not suffice 
to justify a hypothesis test, which involves measuring the estimate rela- 
tive to its standard error. Rather, the confidence interval for (5j, con- 
structed from the point estimate b; and its standard error б, will be the 
proper target in most research. 

When a null, say, В; = 1, is specified, the likely intent is that В, is close 
to 1, so close that for practical purposes it may be treated as if it were 1. 
But whether 1.1 is "practically the same as" 1.0 is a matter of economics, 
not of statistics. One cannot resolve the matter by relying on a hypothesis 
test, because the test statistic (b; — 1)/6,, measures the estimated coeffi- 
cient in standard error units, which are not the meaningful units in 
which to measure the economic parameter B; — 1. It may be a good 
idea to reserve the term "significance" for the statistical concept, 
adopting "substantial" for the economic concept. 

There is a further objection to the common practice of indiscrimi- 
nately reporting all the “t-statistics” for a regression: it encourages rank- 
ordering of the explanatory variables with respect to their "importance." 
What does it mean to say that in a multiple regression one explanatory 
variable is “more important” than another? 
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A simpie example may help to address this question. Suppose that 
this estimated regression is reported: 


ў = 50 + 2х, — 1х. 


A naive reader might conclude that xə is "more important” than xs 
because its coefficient is larger in magnitude. A more sophisticated 
reader would recognize that the magnitude of the coefficients can be 
changed arbitrarily by changing the units in which the variables are 
measured. So he might ask for the standard errors. Being told that the 
standard errors for b; and b, are both 0.5, so their “t-statistics” are 4 
and —2, he might conclude that хә is “more important” than xs because 
its “t-statistic” is larger in magnitude. But that conclusion is not sensible 
if in fact the variables are y = weight (in pounds), x» = height (in inches), 
хз = exercise (in hours per week), and the regression is to be used by a 
physician to advise an overweight patient. Would either the physician 
or the patient be edified to learn that height is *more important" than 
exercise in explaining variation in weight? 

The moral of this example is that statistical measures of "importance" 
are a diversion from the proper target of the research—estimation of 
relevant parameters—to the task of "explaining variation” in the depen- 
dent variable. 


| 22.6. Using Asymptotics 


In the CNR model, provided that n — k is large, there is no need to 
refer to the t- and F-tables when o? is unknown. Recall the two asymp- 
totic results shown in Section 18.3: 


(1) If u ~ п), then и > (0, 1). 
(2) Ifv ~ F(m, n), then то > x" (m). 


Applied to the CNR model, (1) implies that there is no objection, when 
n — kis large, to treating 


as if it were 
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In the same manner, (2) implies that there is no objection, when n — k 
is large, to treating 


Ф = pu = (t — 9)D(t — 0)/6° 
as if it were 
ш = (t — 9)D(t — 0yo*: 


For example, with n — k = 200 and p = 2, the exact 5% critical value 
d, = 3.04 from the F-table gives pd, = 6.08 as the critical value for à = 
ро, while the approximation will use с, = 5.99 from the chi-square table. 

This simplification applies to hypothesis tests as well as to confidence 
region construction. 


22.7. Inference without Normality Assumption 


From Chapter 19 on, the theory has relied on normality of y. In practice, 
researchers routinely use the t- and F-procedures without making an 
explicit normality assumption. A better practice might be to use the 
normal and chi-square approximations of Section 22.6, for an asymp- 
totic theory appropriate to the CR model implies that b is asymptotically 
normal, that 4; is asymptotically N(0, 1), and that @ is asymptotically 
X'(p). Without normality, there is no presumption that the t- and F- 
tables offer better approximations to the exact distributions of those 
statistics even when the sample size is small. 

To develop an asymptotic distribution theory that is appropriate to 
the CR model without normality, additional specification is needed. How 
does the X matrix develop as n increases? That is, how are the additional 
rows of X generated? In random sampling from a multivariate popu- 
lation, further specification is unnecessary, because random sampling 
extends itself automatically. But with our stratified sampling scheme, 
some additional assumptions are required. The most natural assump- 
tion is 

lim(Q/n) = Ф, 


where Ф is positive definite. (Here lim is shorthand for limit as n — o.) 
To see the implications of this assumption, first rewrite V(b) as 


Vib) = PQ = (o?/n(Q/n) '. 
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If lim(Q/n) = Ф, with Ф positive definite (hence invertible), then 
lim(Q/n) ! = ^. Since lim(o?/n) = 0, that would imply lim V(b) = О. 
Since E(b) = B for every n, it would follow (by the multivariate version 
of convergence in mean square) that b > В, and b would be a consistent 
estimator of B. 

Suppose further that the є, = y; — х; are independent and identically 
distributed—which is stronger than the uncorrelated and identical 
expectation and variance assumptions of the original CR model. Then 
it can be shown (by a multivariate extension of the Central Limit 
Theorem) that 


b 4 NIB, (o?/n) 1]. 


Similarly it can be shown that ô? -> o°. The net result is that the 
asymptotic approximations of Section 22.6 will apply even without 
assuming normality for y. See Amemiya (1985, pp. 95-101), Judge et 
al. (1988, pp. 264—270), or Greene (1990, pp. 312—318). 

Henceforth when we report asymptotic properties in models with 
nonstochastic X, we shall be presuming that additional assumptions of 
the type introduced here are met. 


Exercises 


22.1 Suppose that the CNR model applies to E(y) = х3, + xB. + 
хз + x4B4. Let z = x4 + x4. For a sample of 124 observations, regressing 
у On (х1, Xo, X3, x4) gives 60 as the sum of squared residuals, while 
regressing y On (ху, xs, 2) gives 64 as the sum of squared residuals. Test 
at the 5% significance level the null hypothesis В, = B4 against the two- 
sided alternative B; 5 By. 


22.2 Suppose that the CNR model applies to E(y) = х,В, + В + 
x3B3 + x,B4, where у = log output, x, = 1, x» = log capital, хз = log 
land, and x, = log labor. Let ш = y — x4, z, = xj, 2g = Xo — Xa, Z3 = Xs — 
x4. For a sample of 104 firms, regressing y on (x), xs, хз, x4) gives 70 as 
the sum of squared residuals, while regressing w on (21, Zə, Zs) gives 80 
as the sum of squared residuals. 


(a) Test at the 5% significance level the null hypothesis В + Bs + 
34.= 1 against the two-sided alternative By + Bs + В. ғ 1. 
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(b) Let v = y — xo, t = xi, fg = X3 — Xo, ls = x4 — Хо. If v is regressed 
on (ti, Ё, t3), what sum of squared residuals will be obtained? 


22.3 The CNR model applies to E(y) = xiB, + xsBs + xsBs + x,B, + 
XsBs. A researcher regresses y on (xi, xo, xs, x4, х5), and also regresses w 
on (21, 2), where ш = y — X4, Zi = Xj, Zo = Xo + Xs. 


(a) State the joint null hypothesis that is testable by a comparison of 
the sum of squared residuals from those two regressions. 

(b) What is the "numerator degrees of freedom" parameter for that 
test? 


22.4 The regression slope b in а CNR model is distributed N(B, 1). 
The null hypothesis В = 0 will be tested by using the statistic 2° = b/o,. 


(a) For a conventional (two-sided alternative) situation, consider run- 
ning the test at the 10% and 5% levels. Write and run a program 
that tabulates the power of the two tests at these nine values of 
the true parameter В: 


-2 -1.5 -1 -0.5 0 05 1 15 2. 

(b) What does your table say about the effect of significance level on 
power? 

(c) Now consider a one-sided-alternative situation, the alternative 
being B > 0. Specify an appropriate one-tailed procedure that 
uses 2? and operates at the 5% level. Tabulate the power of the 
test at the nine values of B given above. 

(d) Comparing your results in (c) with those in (a), what do you 
conclude about the relative merits of one-tailed and two-tailed 
tests at the same significance level? 


22.5 For the earnings function estimated in Exercise 21.6, consider 
bs, the coefficient on race. Is it significantly different from zero? Is it 
large, that is, substantial in economic terms? 


23  Multicollinearity 


23.1. Introduction 


Multicollinearity, or simply collinearity, refers to correlation among the 
explanatory variables in multiple regression. As in Section 19.4, let us 
focus on the slope coefficients B; (J = 2, ..., k) in a CR model that 
includes a constant as ху. The estimated slopes are the b, whose vari- 
ances are 


05 = o*/q$ = с?/х#'х# = о [а -R)XG;- 5] , 


where Rj is the coefficient of determination in the auxiliary regression 
of x; on all the other x's. The condition that X have full column rank 
rules out exact collinearity: because rank(X) = k, no x; can be an exact 
linear function of the other x's, so no А? will equal 1. But the rank 
condition does not rule out high collinearity—one or more Rj's that are 
close to 1. Indeed, many economic data sets do show high auxiliary 
Rj's, and virtually none show zero Rj's. From the variance formula, we 
see that ceteris paribus, a high auxiliary R? makes for a large 05, As 
Judge et al. (1988, p. 882) write: 


Multicollinearity is defined as the existence of one or more near- 
exact linear relations among the columns of the regressor matrix 
X. The consequences of multicollinearity are that the sampling 
distributions of the coefficient estimators may have such large var- 
lances that the coefficient estimates are unstable from sample to 
sample. Thus they may be too unreliable to be useful. 


When its variance is large, the estimator will be imprecise, the sample 
value may well be far away from the true value, the confidence interval 
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for B; will be wide, very diverse hypotheses about В; will all be acceptable, 
hypothesis tests on B; will have little power, and b; will not be significantly 
different from "anything." In short, our best estimate of В, will not be 
very good, and the sample will have told us little about the true value 
of B. 

All these unpieasant things are fully reflected in the standard error 
of b,, just as they would be if R? were zero while the variation of x;, 
namely 2x; — xy. were small and/or the (conditional) variance of the 
dependent variable, namely o?, were large. The LS estimate b; is still 
the MVLUE, its standard error is still correct, and the conventional 
confidence interval and hypothesis tests are still valid. 

Nevertheless, in empirical research papers one comes across com- 
plaints such as "the standard errors are inflated because of collinearity," 
or "this variable is really significant but multicollinearity makes it look 
insignificant." 

To evaluate such complaints, consider a simpler situation: estimating 
a univariate population mean when the sample size is small. Suppose 
that a random variable y has expectation р and variance c^. In random 
sampling, sample size n, the MVLUE of р is the sample mean y, whose 
variance is У(ў) = с?/т. If n is small, then ceteris paribus, У(ў) is large. 
If V(y) is large, then our estimator of р is imprecise, the estimate y may 
well be far away from the true p, the confidence interval for р will be 
wide, very diverse hypotheses about p will all be acceptable, hypothesis 
tests on p will have little power, and ӯ will not be significantly different 
from “anything.” In short, our best estimate of p will not be very good, 
and the sample will have told us little about the true value of р. 

So the problem of multicollinearity when estimating a conditional 
expectation function in a multivariate population is quite parallel to the 
problem of small sample size when estimating the expectation of a 
univariate population. But researchers faced with the latter problem do 
not usually dramatize the situation, as some appear to do when faced 
with multicollinearity. 


23.2. Textbook Discussions 


It may be that econometrics textbooks contribute to the dramatization 
of multicollinearity by giving elaborate attention to the subject. Johnston 
(1984) writes: 
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The prevalent case in so much econometric work, especially with 
time series data, is one of high but not exact multicollinearity. This 
raises three questions: 1. What effects to expect from multicolli- 
nearity. 2. How to detect the degree of multicollinearity. 3. What 
remedial action to take. (p. 245) 


Among the effects to expect: 


A common result is to find regressions possibly with a very high 
overall R?, but with some (or many) individual coefficients appar- 
ently insignificant. The high R? arises when the y vector is close to 
the hyperplane generated by the x, vectors and the apparently 
insignificant coefficients arise because the x; vectors are nearly lin- 
early dependent. (pp. 248—249) 


However: 


It is also possible to find a high А? and highly significant t values 
on individual coefficients, even though multicollinearity is serious. 
This can arise if individual coefficients happen to be numerically 
well in excess of the true value, so that the effect still shows up in 
spite of the inflated standard error and/or because the true value 
itself is so large that even an estimate on the downside still shows 
up as significant. (p. 249) 


Among the detection devices is |X'X|. This determinant 


declines in value with increasing collinearity, tending to zero as 
collinearity becomes exact. While a useful warning signal, we have 
no calibration scale for assessing what is serious and what is very 
serious. (p. 249) 


As for remedies: 


More data is no help in multicollinearity if it is simply “more of the 
same.” What matters is the structure of the X'X matrix, and this 
will only be improved by adding data which are less collinear than 
before. However, there is often no easy way for an econometrician 
to get better data. The data are produced by the functioning of the 
economic system, and the collinearities reflect the nature of that 
system. (p. 250) 


248 23 Multicollinearity 


Turning to another text, we find that Judge et al. (1988, chap. 21) 
devote over twenty-five pages to multicollinearity. They point out that 
coefficients may appear to be nonsignificantly different from zero, and 
hence variables may be dropped from the regression, not because the 
variables have no effect, but rather because the sample is inadequate to 
estimate the effects precisely. This can happen even though the mulüple 
R? is high enough to indicate that the full regression has significant 
explanatory power. 

They argue that methods are required to detect the presence, severity, 
and form or nature of multicollinearity. They review some methods 
used to decide that the multicollinearity is severe: the simple correlation 
between a pair of explanatory variables exceeds 0.8 or 0.9, or the simple 
correlation exceeds the R? of the main regression. Such cutoff points 
are, they warn, arbitrary, and “pairwise correlations can give no insight 
into more complex interrelationships" when more than two explanatory 
variables are involved (p. 869). 

Other methods are discussed: the determinant of X'X, variance infla- 
tion factors, auxiliary regressions, Theil’s multicollinearity effect, and 
matrix decompositions. In the decomposition approach, relatively small 
characteristic roots of X'X indicate near-linear dependencies among the 
explanatory variables, and the associated characteristic vectors identify 
the dependencies themselves. They remark (p. 870) that "analysis of 
the characteristic roots and vectors of the X'X matrix can reveal much 
about the presence and nature of multicollinearity." They view the 
decomposition approach as the best of the available devices, but caution 
that it does not provide a complete solution: fixing a cutoff point for 
relative smallness is just a rule of thumb, and the method may fail to 
isolate multiple linear dependencies from one another. 

Judge et al. go on to discuss several strategies for mitigating the effects 
of severe multicollinearity, while emphasizing that none of those strat- 
egies is completely safe. 


23.8. Micronumerosity 


Econometrics texts devote many pages to the problem of multicolli- 
nearity in multiple regression, but they say little about the closely anal- 
ogous problem of small sample size in estimating a univariate mean. 
Perhaps that imbalance is attributable to the lack of an exotic polysyllabic 
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name for "small sample size." If so, we can remove that impediment by 
introducing the term micronumerosity. 

Suppose an econometrician set out to write a chapter about small 
sample size in sampling from a univariate population. Judging from 
what is now written about multicollinearity, the chapter might look like 
this: 


Micronumerosity 

The extreme case, “exact micronumerosity," arises when n = 0, in 
which case the sample estimate of p is not unique. (Technically, 
there is a violation of the rank condition n > 0: the matrix 0 is 
singular. The extreme case is easy enough to recognize. “Near 
micronumerosity" is more subtle, and yet very serious. It arises 
when the rank condition л > 0 is barely satisfied. Near micronu- 
merosity is very prevalent in empirical economics. 


Consequences of micronumerosity 

The consequences of micronumerosity are serious. Precision of 
estimation is reduced. There are two aspects of this reduction: 
estimates of p may have large errors, and not only that, but V(¥) 
will be large. 

Investigators will sometimes be led to accept the hypothesis р = 
0 because u° = 9/6; is small, even though the true situation may be 
not that и = 0 but simply that the sample a have not enabled 
us to pick p up. 

The estimate of p will be very sensitive to sample data, and the 
addition of a few more observations can sometimes produce drastic 
shifts in the sample mean. 

The true р may be sufficiently large for the null hypothesis p = 
0 to be rejected, even though V(¥) = o°/n is large because of micro- 
numerosity. But if the true р is small (although nonzero) the 
hypothesis u = 0 may mistakenly be accepted. 


Testing for micronumerosity 
Tests for the presence of micronumerosity require the judicious 
use of various fingers. Some researchers prefer a single finger, 
others use their toes, still others let their thumbs rule. 

A generally reliable guide may be obtained by counting the 
number of observations. Most of the time in econometric analysis, 
when n is close to zero, it is also far from infinity. 
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Several test procedures develop critical values n*, such that 
micronumerosity is a problem only if n is smaller than n*. But those 
procedures are questionable. 


Remedies for micronumerosity 
If inicronumerosity proves serious in the sense that the estimate of 
u has an unsatisfactorily low degree of precision, we are in the 
statistical position of not being able to make bricks without straw. 
The remedy lies essentially in the acquisition, if possible, of larger 
samples from the same population. 

But more data are no remedy for micronumerosity if the addi- 
tional data are simply “more of the same.” So obtaining lots of small 
samples from the same population will not help. 


If we return from this fantasy to reality, several lessons may be drawn. 

* Multicollinearity is no more (or less) serious than micronumerosity. 
Exact multicollinearity (Rj = 1) is a close analogue of exact micronu- 
merosity (n = 0). When a research article complains about multicolli- 
nearity, readers ought to see whether the complaints would be con- 
vincing if “micronumerosity” were substituted for “multicollinearity.” 

* For example, if a test for exact multicollinearity is reported, the null 
hypothesis being RF = 1, readers ought to consider whether they would 
test the null hypothesis n = 0. Or if a test for orthogonality is reported, 
the null hypothesis being R? = 0, readers ought to consider whether 
they would test the null hypothesis that n is large. It is quite sensible to 
measure n, but would one want to undertake a statistical test on the 
true value of n? 

* For another example, if a rule is proposed to decide whether the 
collinearity is severe (how large Rj has to be before one says that there 
is a multicollinearity problem), readers ought to consider whether it is 
plausible to develop a rule that decides how small n has to be before 
one says that there is a small-sample-size problem. 


23.4. When Multicollinearity Is Desirable 


Multicollinearity may make the estimates of individual 8/5 imprecise, 
while facilitating the precise estimation of particular combinations of 
the elements of B. Suppose that we have estimated 
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Е(у) = В, + В + әз, 
by 
$ = bi + хәрә + хзЁз. 


Let 0 = В, + B5, which is estimated Бу ¢ = bə + 2з. The variances of 
the estimates are 


2-2 29 2 . 2 33 252.50.909. 33 23 
съ, = 0 q^, 04,704, 9 = 044 tq +24), 


where the q” are elements of Q`'. Take the special case where о? = 1, 
and 


ГА 


9% = b j TE = (0) = (1-0) È 1) ; 
Here r is the sample correlation between x, and x3. We have 

o = W1l-r)=o, 0? = 9/(1 + р). 
If = 0, there is no collinearity, and 

0,70,-1 of =2. 
But if r = 0.9, there is strong collinearity, and 

0, = оу = 1/0.10 = 5.3, о? = 2/1.9 = 1.05. 


In this example, collinearity hinders precise inference about B, and 
Вз separately, but facilitates precise inference about their sum 8 = 
B2 + Вз. So if we happen to be interested in that particular Ө, then the 
high positive collinearity is desirable. For further discussion, see Conlisk 
(1971). 


23.5. Remarks 


* [n the CR model, all the consequences of multicollinearity are 
reflected in V(b) = o°Q™', or in its unbiased estimator V(b) = 6°Q™'. 
Researchers should not be concerned with whether or not “there really 
is collinearity.” They may well be concerned with whether the variances 
of the coefficient estimates are too large—for whatever reason—to pro- 
vide useful estimates of the regression coefficients. 
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* Multicollinearity is just one of the possible sources of high оу. For 
estimation of B;, what is desirable per se is not low collinearity (small 
RẸ) but rather low coefficient variance (small 05). 

* A sensible researcher may well want to calculate the auxiliary R?'s, 
but it is unlikely that she will want to test hypotheses about their true 
magnitudes. 

* To say that “standard errors are inflated by multicollinearity" is to 
suggest that they are artificially, or spuriously, large. But in fact they 
are appropriately large: the coefficient estimates actually would vary a 
lot from sample to sample. This may be regrettable but it is not spurious. 

* To say that "the coefficient is really significant but multicollinearity 
makes it look insignificant" is to confuse statistical significance with 
economic significance: see Section 22.5. 


Exercises 


23.1 These results were found for LS regression of y — executive 
salaries on x, = sales and x = profits, across a sample of 102 firms: 


$ = 0.50x, + 0.40xs, ее = 250, ХХ = e M 2 
(0.83) (0.83) 

(All variables had been expressed as deviations about means for соп- 
venience.) Assume that the CNR model applies to the salary function 
E(y) = Bix; + Box». Evidently, the high collinearity between sales and 
profits has prevented precise estimation of the parameters of the salary 
function. To eliminate this problem, it has been proposed that we 
proceed as follows. First, regress profits on sales, and obtain the resid- 
uals x$. Second, regress y on x, and хў to estimate the parameters of 
the salary function. Denote the results of the second step by j* = 
Суху + Сәх. 


(a) Calculate c, and cs, and calculate their standard errors. 

(b) Evaluate the proposal as a device for eliminating collinearity. 

(c) Evaluate the proposal as a device for obtaining more precise 
parameter estimates. 


23.2 The CR model applies and the X matrix shows high collinearity. 
The sample size is doubled by getting two observations on 5, rather than 
` one, at each of the rows of the original X. 
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(a) What happens to the degree of collinearity? 

(b) What happens to the variance of the LS coefficients? 

(с) Comment on the claim that more data is no remedy for the 
multicollinearity problem if the data are simply "more of the 
same." 


23.3 Suppose that џи = E(y) = Xp, where p and X are known and В 


is unknown. 


(a) Under what condition on the rank of X is B uniquely determined? 
(b) Comment on the relevance of this result to the multicollinearity 
problem. 
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24.1. Introduction 


In empirical research, it is common practice to run several versions of 
a regression. We will explore some reasons for this practice and consider 
how the resulting estimates may be interpreted. 


24.2. Shortening a Regression 


Suppose that as a result of high collinearity (or for some other reason), 
the LS coefficient estimates are not precise enough to be useful. What 
should be done? The appropriate response will depend upon the objec- 
tive of the research. If the objective were to “explain the variation in 
y," that is, to get a good fit, that is, to get a high R^, then there would 
be no good reason to be concerned with the individual b/s. And so there 
would be no good reason to be bothered by large standard errors and 
"nonsignificance" of the individual coefficients. 

But suppose that the primary research objective is to learn about В, 
in the model 


E(y) = Х.В, + Х,В,. 


For example, consider the household demand for butter, where у = 
expenditures on butter; the А, variables in X, include the constant, 
income, butter price, and margarine price, whose coefficients are of 
interest; and the kọ X 1 variables in X; include family size, occupation, 
and location, which are included as "control" variables. 

We run the long regression ў = Xjb, + Х.Б», intending to use b, as 
the estimator of B,. If the CR model holds, then 
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E(b,) = В,  Vb)-cQ'-c'(Qt) = e (XiM;Xy) '. 


The estimated variance matrix of b; is V(b,) = 6?Q!!. Suppose that the 
diagonal elements of Y (b,) are so large that b, is not adequately infor- 
mative about the parameter vector f. 

A natural response is to shorten the regression, that is, to run y on 
X, alone, reporting bf instead of b, as the estimate of В,. To motivate 
that response, recall from Eqs. (17.7)-(17.8) that 


E(bi) = В, + ЕВ», 
V(bf) = 0° (KX), 

with F = A,X». As in Eq. (17.9), the variance comparison is clear-cut: 
V(b) = V(bi) + FV(bj)F', 


where FV(b;)F' is nonnegative definite. So V(b,) = V(bt), regardless of 
the value of Bə. The bias of bf as an estimator of B,, namely ЕВ,, 
vanishes if By = 0, that is, if the omitted explanatory variables are 
irrelevant. 

From this perspective, one can identify at least three distinct rationales 
for reporting bf rather than b,, that is, for using the short, rather than 
the long, regression: 

(1) We believe that B» = 0. Excluding Xs, as the short regression 
does, introduces no bias, and does reduce the variance of the estimator 
of B,. Indeed if В, = 0, then bf is the MVLUE of B,, because the CR 
model will apply to E(y) = Х,В,. 

(2) We do not believe that В, = 0, but we have lowered our aspiration 
level. Rather than insisting on estimating В, and В; separately, we will 
be content with an estimate of BY = B, + ЕВ». Indeed bf is the MVLUE 
of that parameter combination. 

(3) We do not believe that В, = 0, nor will we be content with 
estimating В?, but we have lowered our aspiration level in a different 
way. Rather than insisting on an unbiased estimator of В,, we will be 
content with a biased estimator, provided that its bias is sufficiently 
offset by reduced variance. 
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We focus on rationale (3) from the list above. The idea is that it is 
plausible to prefer a biased estimator to an unbiased one provided that 
the former's variance is sufficiently small. 

To assess the available trade-off between bias and variance, we gen- 
eralize the mean squared error criterion introduced in Section 11.3. If 
a random vector t has expectation vector E(t) and variance matrix V(t), 
then, as an estimator of the parameter vector Ө, its mean squared error 
matrix is 

S(t; 9) = E[(t — 0)(t — 0)'] = Vit) + [E(t — O)][E(t — O)]'. 


The minimum mean squared error (MSE) criterion for choosing an 

estimator of Ө says that we should prefer t, to t, if S(t,; Ө) = S(t; Ө) in 

the matrix sense, that is, if S(t;; Ө) — S(t,; Ө) is nonnegative definite. 
For the short- and long-regression estimators of В,, we have 


S = S(b,; Bi) = V(b,), 
S* = S(bt; B) = УФ) + FB.B;F'. 
Subtracting gives 
D = S — S* = FV(bjF' — FB BSF’ = F[V(b;) — ВВЕ". 


By the MSE criterion, bf is preferable to b, if the matrix D is non- 
negative definite, a sufficient condition for which is that the matrix 
[У(Ь,) — В.В] is nonnegative definite. Heuristically, this condition says 
that the magnitude of В» is small relative to the variance matrix of its 
estimator bg. 

Take the special case where А = 1. Now bg and В» are scalars, and 
V(b) — В.В; = oj, — Bi so D is nonnegative definite if ту = 
(B5/o,,)* = 1. For this scalar case we have a clean conclusion: on the 
MSE criterion, prefer bf to b, iff 72 = (B/O) = 1. This gives а 
particular precise meaning to the notion that the bias is small enough 
to be offset by reduced variance. Specialize further to the case where 
k, = Ь = 1. Here b, is also a scalar, and 


$ = 07, = o^ /gf, 
$* = 0j. + ЕВ} zs 0°/41 + (912/011) Ве. 


As functions of Ba, S is constant while S* is linear in B2. At В, = 0, S* = 
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S because qf, = qı- As By departs from zero in either direction, S* 
increases, equaling S at В = +0, (ie. at Tg = +1), and thereafter 
exceeding S. The short-regression estimator is preferable provided that 
B» is sufficiently close to zero. 


Example. Таке а? = 1, and dii = do = 1, so т = о lies between 
—] and 1. Then qf, = (1 — 7°), whence 


S-V1-r) S*=147°p3. 


Figure 24.1 takes r = 0.5 and plots S and S* against B. The curve 
marked S** will be explained later. 


These special cases illustrate a tension that is almost inevitable in 
those empirical research situations in which the primary objective is to 
learn about a subset of the regression coefficients. The tension is 
between shortening and lengthening the regression, between "under- 


MSE 


б 2 4 6 8 10 12 14 16 
b2? 


Figure 24.1 Pretest estimation mean squared errors: S, S*, S** = MSE’s 
for bi, bt, рж. 
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specifying" and "overspecifying" a regression function, between bias 
and variance. There is an incentive to exclude control variables to 
reduce variance, but doing so may introduce bias. There is an incentive 
to include control variables to avoid bias, but doing so may increase 
variance. The MSE criterion offers a particular evaluation of the avail- 
able trade-off. 


24.4. Pretest Estimation 


Continue with the kọ = 1 case. If т> were known, the choice between 
bf and b, would, on the MSE criterion, be clear-cut. But with 72 
unknown in practice, how shall we implement the MSE criterion? It is 
natural to use the sample to learn about the value of 75, and the natural 
estimator of 


3 = (B/S)? 
is 
б = (6/6). 


But this is precisely v^, the F-statistic (squared t-statistic) that we would 
use to test the null hypothesis B, = 0 against the alternative B, # 0. 

If we were testing the null By = 0, large values of v° would lead to its 
rejection, small values to its acceptance. In the present context, we do 
not wish to test By = 0 (which is equivalent to т; = 0), but we will use 
the same statistic v^, to serve as an indicator of whether т> = 1. Evidently, 
small values of the statistic will favor small values of 72. 

We have arrived at a particular regression strategy for estimation of Ву, 
namely pretest estimation. Generalizing to the case where k, > 1, we may 
spell out the procedure as follows. 


(i) Choose some cutoff value d. 

(i) Run the long regression of y on (X,, X3), obtaining b,, bo, 
and 6°. 

(ii) Calculate v? = b2Q,b./(k,6^). 

(iv) If v? > d, then report b, as the estimate of B,. 

(v) If v? = d, then run the short regression of y on X,, and report 
its bf as the estimate of fj. 
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The cutoff value d may be the critical value associated with some sig- 
nificance level in the F-distribution, although we are not testing the null 
В: = 0. 

In any sample, either b, or bf is selected as the estimate. Formally, the 
pretest estimator, say bf*, may be written as a weighted average of the 
short- and long-regression estimators. Let = 1 if v? = d, and z = 0 if 
i? > d. Then the pretest estimator is 


b}* = (I — z)b, + zbf. 
As a guide to thinking about the distribution of bf*, consider two 


examples, drawn from outside the regression context, that illustrate 
how selection affects the distribution of sample statistics. 


Example. Suppose that X and Y are independent Bernoulli var- 
iables, each having parameter f. Let Z = max(X, Y). Then 


Рг(2 = 1) = Р(Х = LY = 1) + Pr(X = 1, Y= 0) + Pr(X = 0, Y = 1) 
= p + pl-p) + (1-рур 
= pt pl р). 


So Z is a Bernoulli variable with parameter p* = р + p(l — f). Then 
E(Z) = p* > p = E(X) = E(Y). For example, if p = 0.05, then E(Z) = 
0.0975. 


Example. Suppose that X and Y are independent standard 
normal variables. Let Z = max(X, У). Let f(-) and F(-) denote the stan- 
dard normal pdf and cdf. Then the cdf of Z is 


G(z) = Pr(Z = z) = Р(Х € z N Y 2) = P(X = z2)Pr(Y = z) = F*(2). 


Clearly, the probability that Z exceeds some value c, namely 1 — G(c) = 
1-F (су, is greater than the probability 1 — F(c) that X (or Y) exceeds 
that value. The pdf of Z, namely 


g(2) = eG(z/az = 2F(z)f(2), 
is plotted in Figure 24.2. Observe how the selection shifts the distribu- 


tion (and hence the expectation) to the right. 


Returning to the regression context, we recognize that, because of 
the selection, the distribution theory for the pretest estimator is more 
complicated than that for either b, or bf. For an introduction to the 
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-3 -2 -1 Q 1 2 3 


Figure 24.2 Pdf of maximum of two independent (0, 1) variables. 


theory, see Wallace and Ashar (1972) and Judge et al. (1988, pp. 832— 
835). 


Example. Resume the example of Section 24.3, with o? = 1, 
qi = фә = 1, фә = т = 0.5. Suppose that the CNR model applies, and 
for convenience suppose that с? is known, so that the chi-square statistic 
w= (bT) may be used instead of the F-statistic v°. Then S**, the 
MSE for this ideal version of the pretest estimator b¥*, can be calculated 
fairly readily from the properties of the bivariate normal distribution: 
see Exercise 24.1. In Figure 24.1, the curve S** is drawn for d — 3.84, 
which corresponds to a nominal 5% significance level test of By = 0. As 
this example indicates, there is no range of B3 over which the pretest 
estimator dominates the other two estimators. So we get no clear-cut 
guidance about the attractiveness of pretest estimation. 


Explicitly or implicitly, the pretest strategy is followed by many empir- 
ical researchers. In journal articles, you will often find that the author 


24.5 Regression Fishing 261 


has "experimented" with some alternative specifications before arriving 
at a final, preferred, regression. The experimentation is often of the 
type considered here, that is, shortening a regression when some coef- 
ficients in the long regression are "not significantly different from zero." 
Other restrictions on a regression are sometimes employed, for 
example, constant returns to scale. The motivation is the same: 
restricted estimates may be biased, but have smaller variance. Indeed, 
as shown in Section 22.2, any set of linear restrictions can be translated 
into a zero subvector form, so the analysis applies directly. 

The usual computer output (conventional standard errors calculated 
for the selected regression) will not do justice to the pretest strategy. In 
any sample, the strategy will select either b, or bf, and the conventional 
standard errors (from the long or short regression respectively) simply 
do not take account of the stochastic nature of that selection. Readers 
should at least be aware of the exploratory process that led the author 
to the final, selected, regression. 

As a writer, it is a good idea to put yourself in the position of a 
prospective reader: provide the information that you would want to 
have if you were the reader. For some suggestions, see Leamer (1983). 


24.5. Regression Fishing 


There is a popular style of regression analysis that, though reminiscent 
of pretest estimation, is quite distinct in character. Here again the data 
consist of y, X,, X5, but now the researcher has no particular interest 
in either B, or В». Instead he wants to "explain the variation in the 
dependent variable" using only a few explanatory variables. In starkest 
form, the procedure is: 


(i) Run y on X,, obtain bf, Y (bt), and R*(1). 
(ii) Run y on X;, obtain b¥, V(b3), and R*(2). 
(ш) If R"(1) > R(2), then report the first short regression. 
If R*(1) = R°(2), then report the second short regression. 


(For convenience, we have allowed the summer vector to appear in both 
X, and Xə.) Alternatively described: run the long regression, test B, = 
0, also test B; = 0, then report the “more significant" of the two short 
regressions. 
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Naturally, the coefficients in the reported regression will tend to be 
statistically significant when assessed by conventional standards. But 
those standards are clearly inappropriate. To report and interpret a 
selected model as if it were an unselected model is incorrect, as the 
examples in Section 24.4 illustrate. More convincing, perhaps, is an 
example drawn from Lovell (1983). Suppose that you have a set of k 
null hypotheses, each of which is tested at the significance level a. 
Suppose that all the null hypotheses are true. What is the probability 
of getting at least one rejection? That is, what are the chances of getting 
at least one nominally significant result? If the test statistics are inde- 
pendent, then 


Pr(at least one rejection) = 1 — Pr(all accepted) = 1 — (1 ~ о)". 


For example, with k = 10 and a = 0.10, the probability is 1 — (0.90)? = 
0.65, which means that the actual significance level is 6596 rather than 
the nominal 10%. Unless you want to test at the 65% level, you should 
not consider such an outcome to be statistically significant. It is hardly 
surprising, and perhaps not even interesting, to obtain a nominally 
significant outcome by fishing. 


Exercises 


24.1 Some of the flavor of the distribution theory for the pretest 
estimator can be captured in a simpler context. Suppose that the random 
variable y ~ N(p, 1), and that we suspect that p is near zero. A single 
observation will be drawn. These three estimators of p: 


E 
play the roles of b,, bf, and bf* respectively. 
(a) Show that the expectation of m** is E(m**) = пор + т, where 
To = Е(—0,) + F(—8,), т, = f(05) — f(8,), 
6,=1lt+4h, 0. =1-и, 


and f(-) and F(.) denote the N(0, 1) pdf and cdf. Hints: (i) If 
y ~ N(p, 1), then t = y — p ~ N(O, 1); (ii) for the (0, 1) 
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pdf, f(t) = (2m) '^ exp(—-#/2), the first derivative is f'(!) 
{уг = —1f(t). 


(b) Show that the variance of m** is 
V(m**) = то(1 — wo) + (тә — m1) + 2wm(1 — ту), 
where 
тә = To — pm, + f(0)) + (0). 
Hint: For the (0, 1) pdf, the second derivative is f''(f) 
af (lat = -fO + fie) = (2 - vo. 


(c) Tabulate the MSE's of m, m*, and m** as functions of wp. 
(d) Comment on the results. 


25 Regression with X Random 


25.1. Introduction 


We now drop the assumption that the explanatory variables are nonsto- 
chastic, and provide models that may be relevant in random sampling 
from a multivariate population. For least squares linear regression we 
report exact results when the population conditional expectation func- 
tion is linear, and then asymptotic results when the linearity assumption 
is absent. The analysis is a direct generalization of the analysis in Sections 
13.1 and 13.2, which referred to random sampling from a bivariate 
population. 


25.2. Neoclassical Regression Model 


Once again the data consist of an n X 1 vector y and an n X & matrix 
X = (x, ..., xj. The neoclassical regression, or NeoCR, model consists 
of these assumptions: 


(25.1) E(y|X) = Xp, 
(25.9) "V(y|X) = o"l, 
(25.3) X stochastic, 
(25.4) rank(X) = А. 


Here “|X” means conditional on the matrix X. The most direct inter- 
pretation of these assumptions is that a CR model holds conditional on 
every value of X. 
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To provide a framework for the NeoCR model, return to the popu- 
lation specification of Section 16.1. Suppose that there is a multivariate 
probability distribution for the random vector (y, х,..., x,)', with pdf 
or pmf f(y, xs, . . . , x,). Expectations, variances, and covariances are 
defined in the usual manner: 


Е(у) = py, Vy) = 07, Сх, х) = Oy, Cl I) = Oyy 


and so forth. Suppose further that the conditional expectation function 
of y given the x’s is linear: 


E(y|xs, -> x) = Bi + Вә +--+ + В, 

and that the conditional variance function of y given the x’s is constant: 
V(ylxs, ..., x) = О”. 

We write these compactly as 
E(y|x) = x’B, V(y|x) = 0°, 


where x = (ху, ...,х)' with x, = 1, and B = (B, ..., B,)’. 

Now we sample randomly from the multivariate population. That is, 
n independent drawings, (yy, х1), - . . » (Yar Xn), are made, giving the 
observed sample data (y, X). The rows of the observed data matrix, 
namely the (y;, xi), are independent and identically distributed. (Cau- 
tion: x; denotes the zth row of X, not the transpose of the ith column 
of X.) In contrast to the CR model, in the NeoCR model the X matrix 
is random. 

Because the y/s are identically distributed, we have E(y,) = н, and 
V(y) = оу for all і. So for the n х 1 random vector y, we have E(y) = 
Sp,, where s is the n X 1 summer vector. This too contrasts with the 
CR model, where E(y) = XB and the expectations of the угѕ differ from 
one another. The conditional pdf g(y|x) is the same at all observations, 
so 


E(y;|x,) = х;В, V(ylx) = o? (—101,...,m) 


Those are consequences of the fact that the (y, х;) are identically 
distributed. Now consider the consequences of the fact that they are 
independent. For specificity, consider the pdf of the first observation 
on y conditional on the first two observations on the k X 1 vector x: 


: g*ioilxi, X2) = f*(yi, ху, X3)/A*(x,, хә). 


266 25 Regression with X Random 


Because (у, xj) is independent of х, we have 


fv Xj, X9) = fOr. X3) (X3), 


and because x, and x, are independent and identically distributed, we 
have 


(25.5) — h*(x,, Ko) = h(xj)h(x;). 
So 
(25.6) &*(у\|х\, хо) = fo x,)/h(x,) = gy |X); 


which says that the distribution of y, conditional on (x), хз) is identical 
to the distribution of y, conditional on only x,. By the same logic, 


(25.7) g*slxi. X2) = Қу, x3)/h(x,) = g(yo| xs). 
Proceed to the joint pdf of yı, y» conditional on x, and xs: 
g**(yi, Ух), X9) = fn: хү, Y2» X9)/A*(x,, хо). 


Because (y,, xj) and (Y2, x5) are independent and identically distributed, 
we have 


Р Xi Jos X2) = fov хә, X2), 
and we also have Eq. (25.5). Consequently, 


(25.8)  g**(, yolX1, хә) = д(у,|х,)д(уә|хә) 
= g*GQilxi хә)а*(у»|х\, хә), 


using Eqs. (25.6)—(25.7). This says that, conditional on x, and х›, the 
joint pdf of y, and y; equals the product of their marginal pdf's. In 
other words, the variables y, and yọ are independent conditional on x, 
and xs, as well as unconditionally. 

When two distributions are the same, their expectations and variances 
are the same, so from Eq. (25.6) we conclude that 


Е(у\|х\, хә) = E(yilxi) = хів, 
У(у\|х,, хә) => Vy |x) T o°. 


When two variables are independent, they are uncorrelated, so from 
Eq. (25.8) we conclude that 


C(yi, ysl]xi. Xo) = 0. 
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The same conclusions follow when we condition on X3, .. . , x, along 
with хі, хә. And conditioning on all n rows xj, . . . , x; is equivalent to 
conditioning on the matrix X. So 


E(yilX) = E(yilxi, X»... x) = Е(у |х) = xi. 

VOIX) = Voilxi, х»,...,х„) = V |x) = 0%, 

CO; 3l X) = 0. 
There is nothing special about the first two observations in this regard, 
not even their adjacency. So fori = 1,...,m: 

E(|X) = E(y;|%1, x», +- -> Xn) = E(ydx) = xig, 


(ДХ) = V(ylxy xo, ..., x,) = V(ylx) = o, 
С(у, эһ| X) = 0, ix h. 


Assembling these results for у, . . . , 9, into results for the vector y, we 
have 


Е(у|Х) = ХВ, . V(y|X) = oI, 


which are precisely the assumptions in Eqs. (25.1) and (25.2) of the 
NeoCR model. As for the rank condition, Eq. (25.4), there is a technical 
qualification. In random sampling there is always the possibility of 
drawing an X matrix that does not have full column rank: for example, 
all n of the x;’s might turn out to be identical. Even if the population 
variance matrix of the x's is nonsingular, obtaining a short-ranked X 
has positive probability when the x's are all discrete. To dispose of that 
complication, adopt the convention that any sample with rank(X) < k 
is discarded. Then Eq. (25.4) applies. 

With that understanding, random sampling from the multivariate 
population specified here supports the NeoCR model. 

Itis not the only scheme that would do so. Inspection of the argument 
above will show that to arrive at Eqs. (25.1)-(25.2), there is no need for 
the successive observations on the explanatory variables to be indepen- 
dently or even identically distributed: see Section 13.5 for discussion in 
the bivariate case. What is ruled out in the NeoCR model is the presence 
of the lagged dependent variable among the explanatory variables. For 
in that case, у, will be an element of xs, so E(y|x,, xs) = у, # E(yilxi), 
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whence E(y|X) # XB: see Section 26.5. It is best to think of E(y,|x,) = 
х;В as a necessary, but not sufficient, condition for the NeoCR model 
to hold. 


25.3. Properties of Least Squares Estimation 


It is easy to assess the properties of the LS statistics b = Ay, e = My, 
6? = e'e/(n — k), and V(b) = 6°Q™', in the NeoCR model. The matrices 
Q, A, and M, being functions of X, are now random, but conditional 
on X, they are constant. We calculate: 


E(b|X) = E(Ay|X) = AE(y|X) = АХВ = В, 

V(b|X) = V(Ay|X) = AV(y|X)A' = 0° AA’ = 0’°Q"!, 

E(e|X) = E(My|X) = ME(y|X) = MX = 0, 

V(elX) = V(My|X) = MV(y|X)M’ = о?ММ' = о?М. 
From these it follows that 

E(e'e|X) = o? tr(M) = o°(n — k), 

E(6"|X) = o°, 

E[V(b)|X] = E(6°Q7"|X) = E(?|X)Q' = 0’ Qh! 


So, conditional on any value of the matrix X, the LS statistics b, 6", and 
V(b) remain unbiased. This is a direct consequence of the fact that the 
NeoCR model effectively assumes that a CR model holds conditional 
on every value of X. 

Proceeding to unconditional moments, use the Law of Iterated Expec- 
tations (T8, Section 5.2) to calculate: 


E(b) = Ex[E(b| X)] = Ex(B) = B. 


V(b) = Ex[V(b| X)] + Vx[E(b]X)] = Ex(o?Q ) + О 
= e'E(Q 5, 


Е(6?) = Ex[E(6?|X)] = Ex(o”) = 0°, 
E[V (b)] = Exf{E[V(b)|X]} = Ex(o^Q ') = PEQ’). 
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We see that the LS statistics b, 6°, and V(b) are unbiased unconditionally 
as well. So the LS coefficients and their accompanying standard errors 
are still appropriate. As for optimality, a version of the Gauss-Markov | 
Theorem applies in the NeoCR model: in the class of estimators that, 
conditional on every X, are linear and unbiased, the LS estimator has 
minimuin variance. 

We see that LS estimation retains its attractiveness in the NeoCR © 
model. Some results do differ. For example, in an analysis of the short 
regression, the matrices Е = A,X, and Qf, = X}’X# are now random, 
so that 


E(bf) = В, + E(P)B2, — E(e*'e*) = o'(n — hy) + ВЕСОВ. 


Nevertheless, the main conclusion of the analysis is that the key prop- 
erties of LS estimators carry over when X is allowed to be random. 
Nothing in the randomness of the explanatory variables per se creates 
an objection to LS estimation. 


25.4. Neoclassical Normal Regression Model 


If we strengthen the NeoCR model by assuming that, conditional on X, 
the random vector y is multivariate normal, we obtain the neoclassical 
normal regression, or NeoCNR, model: 


y|X ~ N(XB, о), Х stochastic, ^ rank(X) = k. 


The framework for this is random sampling from a multivariate pop- 
ulation in which the conditional distribution of y given the x’s is y|x ~ 
N(x'B, o°). 

All the distribution results in the CNR model now hold, conditional 
on X. For example, let b; be an element of b. Then 


bi X ~ N(B,, o^q?). 


Observe that the conditional distribution of b; does depend on X via që: 
the conditional distributions are all normal with the same expectation 
but with different variances. 

The marginal distribution of b;, being a mixture of those different 
conditional normals, will not be normal. Nevertheless, the confidence 
interval and region constructions, and the hypothesis test procedures, 


developed in Chapters 19-22, remain valid. To see why, let z = 
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(b; — B/(o Vq). Then z|X ~ (0, 1) for all X, so the marginal distri- 
bution of z; is also N(0, 1). Because it is the 2; variable that is used to 
develop the confidence interval and hypothesis tests, those procedures 
remain valid. For example, let 2? = (b; — BPE V3). Then, if the null 


hypothesis B; = В? is true, 
Pri? > c)|X] = Priz > c)|X] = 1 – Е(), 


where F(-) is the standard normal cdf. This probability does not vary 
with X, so if the null is true, then Pr(z > с) = 1 — F(c). The same logic 
applies to the statistic и, = (b; — B/(6 V5). For, u,|X ~ t(n — k) for all 
X, which implies that u; ~ t(n — k) unconditionally. And the same logic 
applies to the chi-square and F statistics. Confidence levels and signifi- 
cance levels are exactly as they were in the CNR model. 

In summary, we have not been misled by concentrating attention 
heretofore on the stratified-on-x sampling scheme. Rather, we have 
merely avoided writing "|X" throughout. 


25.5. Asymptotic Properties of Least Squares Estimation 


In Chapter 18, for random sampling from a bivariate population, we 
reviewed asymptotic results for the sample linear projection slope that 
did not rely on normality, or on linearity of the CEF, or on homoske- 
dasticity. Those results generalize to cover LS estimation in random 
sampling from a multivariate population. 

It is convenient to revise our notation, isolating the constant from the 
other explanatory variables. In the population, the k X 1 random vector 
z = (x', y)' has expectation vector and variance matrix: 


-n Ы) (ЕФ) 
Se (в) p 


а e Rn o Pr) C(x, Д 
v == (S а x) V) /' 


х 9» 

Consider the best linear predictor of y given x in the population, 
E*(y|x) = а + х'В. The equations determining -its slope vector (see 
Section 14.1) can be assembled into C(x, u) = 0, where u = y — (a + 
x'B). That is, 
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C(x, у) = C(x, x'B) = C(x, x)B = У(х)В, 
or Х.В = S. Provided that X, = V(x) is nonsingular, the population 
BLP slope vector is 

В = (2...) ‘oy. 


Now turn to the sample. Let X, be the n x (k — 1) matrix of obser- 
vations on the nonconstant explanatory variables, x; be the n x 1 
summer vector, and y be the n X 1 vector of observations on the 
dependent variable. From the discussion of deviations from means in 
residual regression theory (Section 17.4), the sample LS slope vector 
can be written as 


b = (Xf'Xi) 'Xi'y* 


where Xf = M;X,, y* = Mjy, M, =I - (l/n) x,x;, or for that matter 
as 


b = (ХХ) (Xi'y*). 


Now recognize that X3’"X3/n = Są the (К — 1) x (k — 1) sample variance 
matrix of the x's, while X3’y*/n = s,, the (k — 1) х 1 sample covariance 
vector of the x's with y. Thus the sample LS slope vector is 


b = (8 s, 
which is the obvious analog estimator of the population BLP slope vector 
В = (5) o, 


In random sampling, sample moments converge in probability to the 
corresponding population moments. So S,, > £X, and s, > C. By a 
multivariate version of S2 (Section 9.5) it follows that 


b + p, 
so the LS slope vector b is a consistent estimator of the population slope 
vector В. Similarly, the LS intercept a = y — x'b is a consistent estimator 
of the population BLP intercept a = E(y) — [E(x)]'B = ы, — p.p. 
Procceding, by using multivariate versions of the CLT and the Delta 
method it can be shown that 


b ^ N(B, Bin), 


with - 
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Ф = (X ERRE E), 
x* =x — Ду, u = y — (a + x'B). 


This generalizes the result (Section 13.1) for the bivariate case, namely 
b ^ N(B, $?/n) with $? = Е(х*?и?)/у (х). 

This asymptotic theory for the sample LS coefficients holds with no 
assumption on the form of the CEF. If the population CEF is linear, 
then the LS estimators are unbiased as well as consistent. If also the 
population conditional variance function is constant, then E(u? |x) = 
V(y|x) = o?, and 


E(x*x*/u2) = E,[E(x*x*^i?|x)] = E,[x*x*'E(u|x)] 


= E(x*x*'o^) = 0° E(x*x*’) = o?V(x) = o?X... 


Then Ф will reduce to o°(&,,)"', and the asymptotic distribution will 
simplify to 


b ^ NIB, (°M) E) !]. 


The asymptotic results serve to justify, as approximations when random 
sampling from any multivariate population, the conventional normal- 
theory confidence regions, confidence intervals, and hypothesis tests. 

In practice, the elements of will have to be estimated. We know 
how to do this for the linear-homoskedastic case. For the general case, 
Są provides a consistent estimator of $. Further, let x**' denote the 
ith row of X¥, and e; denote the ith element of the LS residual vector e. 
Then 


(1/n) È (хх?) 
i=1 i 


provides a consistent estimator of E(x*x*'u’). The square roots of the 
diagonal elements of the resulting estimated ® matrix serve as standard 
errors of the LS coefficients when estimating a population BLP, in the 
absence of assumptions of linearity of CEF and homoskedasticity. They 


may be referred to as "general-heteroskedasticity-corrected" standard 
errors. 
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Exercises 


25.1 Suppose that x and y are bivariate-normally distributed with 
E(ylx) = a + Bx, V(y|x) = o?, and V(x) = o£. In random sampling, 
sample size n from this population, let b be the sample slope and let 5 
be the sample variance of x. Let 


z= Vn(b — Bols), w= nso, и = V(n — 1)(Ь — By(olo,). 


(a) Show that х ~ N(0, 1), that w ~ x?(n — 1), and that z and v аге 
independent. 

(b) Show that u — t(n — 1). 

(c) Explain how the result in (b) completely specifies the marginal 
distribution of the sample slope in terms of parameters and 
sample size. 


26 = Time Series 


26.1. Departures from Random Sampling 


We digress from regression analysis in order to introduce some basic 
ideas on time series. We deal with a single variable y, on which we have 
a set of n observations у, for t = 1,..., п. Here t indexes time, measured 
discretely. 

Figures 26.1, 26.2, and 26.3 display three sets of 100 observations on 
a variable у, with { measured on the horizontal axis, and у, on the vertical 
axis. 

Figure 26.1 was produced as follows. For ¢ = 0, 1, ..., 100, obser- 
vations и, were independently drawn from the N(0, 1) distribution. 
Then, for t = 1,..., 100, we set y, = ш. So the у, series is а size-100 
random sample from the (0, 1) distribution. The joint distribution of 
any adjacent pair of y’s is SBVN(0), so E(y,|y,-1) = E(y,) = 0 regardless 
of the value of y, ,: the conditional expectation of y, given y,_, does not 
vary with y,-;. This lack of predictability is manifest in the jagged and 
irregular time path of Figure 26.1. 

Now, the typical economic time series does not look at all like that 
figure, but may well (perhaps after removing a linear time trend) display 
a relatively smooth and wavelike course such as that in Figures 26.2 and 
26.3. If so, it must be inappropriate to view such a series as a random 
sample from a univariate population. (The (0, 1) population is being 
used only for convenience.) If we plan to work with economic time 
series data, we need an observational scheme that departs from random 
sampling. There are two distinct types of departure from random sam- 
pling: the observations may not be independent, or they may not be 
identically distributed. 


artures from Random Sampling 


Т Ц) ТТЛ n 
ҮШ ШАД 
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Figure 26.2 was produced as follows. For ¢ = 0, 1,..., 100, obser- 
vations и, were independently drawn from a (0, 1) distribution. (In 
fact, the same numerical u, values were used as for Figure 26.1.) Then 
we set уо = ио and generated the remaining y,’s recursively as 


Je = ру tou,  (—L...,100), 


with p = 0.9 and o = V(1 — p°). 

Because y, = ру + Su, is linear in the two independent (0, 1) 
variables уо and u,, with p? + o? = 1, it follows that y ~ N(O, 1). Then 
because ys = py, + сиг is linear in the two independent N(0, 1) variables 
Jı and из, we see that y, ~ N(0, 1). Proceeding in this manner we see 
that each y, is distributed N(0, 1), so they are identically distributed. But 
they are not independent. For example, the covariance between y, and 
J218 


C(yy, 33) = С(у, руу + сш») = PV(y1) + oC(yi, uo) = р, 


and the covariance between y, and y; is 


Cli, уз) = С(уу, Руз + сиз) = pC(yi, уз) + GC(yi, us) = p^. 


What we have is a set of random variables that are identically, but not 
independently, distributed. 

Focusing on an adjacent pair (y, 3,1), we find that their joint distri- 
bution is SBVN(p), because 


c3) 

t p o uo 

where у,_, and и, are independent (0, 1) variables. So, by bivariate 
normal theory, E(y,|y,-1) = ру,—1: the conditional expectation of y, given 
9i-1 does vary with y,.,. This predictability of an observation from its 
predecessor, manifest as a relatively smooth wave in Figure 26.2, suffices 
to distinguish the series from a random sample. 

Turning to Figure 26.3, it may come as a surprise to learn that the 
observations plotted there were independently drawn. The figure was 
produced as follows. For t = 0,..., 100, the u, were independently 
drawn from a A(0, 1) distribution. (In fact, the same и, values were used 
as for Figures 26.1 and 26.2.) Then we set 
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Ye = ш + OU, (¢=1,..., 100), 


where с = 1/3 and p, = (4/3) sin(7.2é), with the angle measured in 
degrees. The p;s are nonstochastic, so y, ~ №, o°). The yrs are inde- 
pendent because the us are independent. But their expectations p, 
differ, so they are not identically distributed. What we have is a set of 
random variables that are independently, but not identically, distributed. 

Focusing on an adjacent pair (y, y,-,), we see that their joint distri- 
bution is BVN(Iu, tu, 0°, 0°, 0) with о? = 1/9. So, by bivariate normal 
theory, E(y,ly,.,) = qu: the conditional expectation of у, does not vary 
with y,_,. In that sense an observation is not predictable from its pre- 
decessor. The regularity in Figure 26.3 is attributable to the sine wave 
pattern in the deterministic p, series, not to any dependence among the 
random variables y,. 

A useful message emerges from the rough similarity of Figures 26.2 
and 26.3: for a real-world economic time series, it may not be self- 
evident which type of departure from random sampling is relevant. 
Perhaps the observations are dependent, being autocorrelated. Perhaps 
they are independent, with a changing expectation ү. In the latter case, 


Figure 26.3 Time series 3. 
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if the expectation is expressible as a linear function of observable explan- 
atory variables, then a CR model might apply. Of course, both depar- 
tures may occur simultaneously. 


26.2. Stationary Population Model 


Let y = (yy, ..., Jo ©- - , Yn)! be an observed vector that displays the 
characteristic regularity of economic time series. One setup that pro- 
duces such y's is а CR model, with p = E(y) = Xp, X = V(y) = o'I, 
and the rows of X showing a regular development over time. 

Let us focus on the other departure from random sampling. We will 
suppose that the ys have identical expectations and variances, but some 
nonzero covariances. Then the elements of p = E(y) are all the same, 
and the diagonal elements of У = V(y) are all the same, but at least 
some off-diagonal. elements of 2 are nonzero. In Section 28.3 we will 
combine the two departures, allowing р = XP as well as a nondiagonal 
X matrix. 

Here we confine attention to an important special case. Assume that 
E(y) = p = x,p, where x, is the n X 1 summer vector and y is a scalar, 
and that V(y) = X, with 


Yo Yı Y2 m Vani 
Yı Yo Yı s+ © Yn- 
Y2 Yı Yo s+ Yn- 
E-|. | | . 
Yn-1 Yn-2 Yn-3 $sfe- se Yo 


We have introduced the notation 

Y= COo J) (= 0, +1, £2,..., £(n — 1)). 
Here y; is the jth autocovariance, with Yo = C(y,, у) = V(y,) being the 
variance. Further, 

ру = C» у. МУО) Vn] = Уо 
is the jth autocorrelation, with ру = 1. Observe that y_; = ү; because 
Cs naj) = CO. Y) = Css y,-j)» with s = t + у. By the same token, 
p_; = pj 
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A distinctive feature of the €, matrix above is that the covariance 
between any two elements of y depends only on the absolute difference 
between their subscripts, that is on the time distance between them. We 
refer to this specification for p and X as the stationary population, or SP, 
model. The term “stationary,” as used here, refers to constancy across t. 
We have the expectations E(y,), the variances V(y,), and the autocovar- 
iances C(y,, у,_) being stationary, that is, constant over t. So the auto- 
correlations are also stationary. Of course, the autocovariances and auto- 
correlations need not be constant across j. 

A stricter form of stationarity holds if the entire joint probability 
distribution of any subset of the variables depends only upon the dif- 
ferences in their time subscripts. Then, for example, the joint distri- 
bution of (y,, Ys, Y7) is the same as that of (уә, Y26> Узв), and of course 
the marginal distribution of each variable is the same. The terminology 
varies in the literature: "stationary" may be reserved for the stricter 
form, with our weaker form being referred to as weakly, or covariance, 
or second-order, or wide-sense, stationary. 


26.3. Conditional Expectation Functions 


At this point, we are prepared to interpret an » X 1 vector y, the 
observed time series, as a single drawing from an n-variate population 
with E(y) = p and V(y) = X as above. It is natural to inquire about 
CEF's in the population, in particular, about the conditional expectation 
of y at time ¢ given one or more past values of y. For the sake of 
convenience, we will suppose that y is multinormally distributed so that 
all CEF's are linear. Normality is not crucial to the analysis; the gist of 
the results will apply to BLP's if the CEF's are not linear. 

As usual, the coefficients of a linear CEF are expressible in terms of 
population expectations, variances, and covariances. The general for- 
mulas of Section 14.1 and Section 25.5 specialize here because of sta- 
tionarity. 

For example, consider the СЕЕ of y, given yı, namely E(y,|y,-1) = 
а + By,_). Here B = у,/у = p, and a = (1 — p,)p. Fora richer example, 
consider the СЕЕ of y, given y,_, and y, s: 


E(y 19-15 3-2) = Bo + Вуу,—1 + Bos 
Because of stationarity, we get By = (1 — B, — 83), and 
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ES 


Dividing through by yp gives 


6, 0-6) 
p 1 Bo P2/ 
the solution to which is 


Bi = Pill ~ P/O — р), В = (Po — PIVO — pD- 


Observe that the autocorrelations suffice to determine the slopes and, 
along with the expectation, to determine the intercept. Observe also 
that the CEF's are themselves stationary: they do not change with t. 

Special cases of the SP model arise when specific assumptions are 
made about the pattern of the autocovariances across j. Such assump- 
tions have implications for the pattern of slopes in the CEF's. 


Example. Suppose that yə = 0, so ps = 0. Then in E(y,|y,_;), the 
slope is В = p,, while in E(y,|y,-1, y,-2), the slopes аге 


В, = p(l- pi), B2 = ~pi/(1 =" рї) = -pB 


Example. Suppose that түү, = Y,/Yo, so ps = pi. Then in 
Е(у,|у,— у), the slope is B = ру, while in E(y,ly,- i, ў), the slopes аге 


В, = pi, Bs = 0. 


In the second example, we see a coincidence of short and long regres- 
sion slopes that does not appear in the first example. This suggests that 
one may be able to discriminate between alternative special cases of £ 
by examination of several CEF's. 

In practice, the CEF's will be unknown, and it will be of interest to 
estimate them. Given a sample from the SP model, that is, an observed 
y-vector, we may consider estimating the parameters of a population 
CEF. The natural procedure is sample LS linear regression of y, on the 
corresponding set of past values. 

For the sake of concreteness, suppose that we wish to estimate 
E(yly-) = а + By. With n observations in hand, the usable data 
consist of the (n — 1) X 1 vectors у = (yo ..., Ya)» xi = (1,..., 1), 

-and x; = (уу, ..., 9-1). Let X = (xi, хз) and В = (o, BY. The LS 
coefficient vector is b = (X'X) 'X'y. To assess its sampling distribution, 


26.4 Stationary Processes 281 


observe first that the X matrix contains elements of the random vector 
y. So X cannot be constant, which means that a CR model is not appli- 
cable. Will a NeoCR model be applicable? With x, = (1, x,9)’ = (1, 9-1)’, 
it is true that E(y,|x,) = а + Ву, = x/B. However, xii = (1, %41,2) = 
(1, y,)' contains the actual value of y,, so that 


E(y |X X1) 7 Е(у,|у,—\› у) =F Е(у,|х,) = xp. 


Consequently, E(y,| X) # XB, whence E(y|X) # XB. 

Despite the linearity of the CEF, a strict requirement of the NeoCR 
model is violated. Consequently, E(b|X) = AE(y|X) does not reduce to 
AXB = B, and hence there is no presumption that E(b) = Ex[E(b| X)] = 
В. Indeed b is a biased estimator of B, and no unbiased estimator of В 
is available. 

Nevertheless, under general conditions, b is a consistent estimator of 
В, and in fact the asymptotic theory for LS estimation (Section 25.5) 
applies. The conditions include a specification of how additional obser- 
vations are produced, that is how the p vector and Z matrix develop 
as п increases. 


26.4. Stationary Processes 


We seek an underlying framework that will support the SP model for 
an observed n X 1 vector y, and that will readily extend as n increases. 
Consider, then, an infinite sequence of random variables ordered in 
time: y, fort = ..., 2, —1, 0, 1, 2,.... The index ¢ denotes time, 
measured discretely. We refer to such an infinite sequence as a stochastic 
process. Each of the variables in the sequence has an expectation and 
variance, and each pair of them has a covariance. Now suppose that all 
the variables have the same expectation and variance, and further that 
the covariance between any pair of them depends only on the absolute 
difference between their subscripts (that is, on the length of time 
between them). We refer to such an infinite sequence as a stationary 
stochastic process. (Here again, the stricter concept of stationarity may 
arise, and the terminology varies.) Evidently, the SP model will apply to 
any n successive variables in such a sequence. 

How can such an infinite sequence of random variables be generated? 
Here are two leading examples of mechanisms that produce stationary 
stochastic processes. 
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First-Order Moving-Average Process 


Suppose that for = ..., ~2, -1, 0, 1, 2, .. . , the values of y, are 
determined by 


(261) у= Po + bu, + u, 


where the u/s are independent drawings on a random variable и with 
E(u) = 0 and V(u) = o°. Because the 1 are independent and identically 
distributed, it follows that the y’s are identically distributed, and indeed 
that the joint distribution of any pair of y’s depends only upon the 
difference in their time subscripts. Similarly for any triplet, and so forth. 
This case is called the first-order moving-average, or MA(1), process. 

The first and second moments of an MA(1) process are readily 
derived. From Eq. (26.1) and the assumptions on the u, it follows for 
every t that: 


(26.2a) à = do + Elui) + Eu) = bo, 
(26.2b) Yo = У(фи 1 + ш) = do? + о? = (1 + ф?)с?, 
(26.2c) y = С(фи,_у + Up фи» + ш) = фо”, 


while 

Y2 = С(фщ—у + u, фи, з + шә) = 0, 
and similarly үз = y4 =... = 0. So the autocorrelations of the MA(1) 
process are p, = yi/yo = $/(1 + $”), and p, = 0 for j = 2, 3,.... For 


апу y containing n successive values of y, the SP model applies: the X 
matrix has all elements zero except along the main diagonal and along 
the strips just above and below that main diagonal. 


First-Order Autoregressive Process 


Suppose that fori = ..., -2, -1, 0, 1, 2,..., the values of y, are 
determined by 


(26.3) у, = Oo + Oyi + us 


where the u/s are independent drawings оп а random variable и with 
E(u) = 0 and V(u) = o°, and |0| < 1. By repeated back-substitution, we 
find 
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Je = в > 9 + P Su, = 0,/(1 — 89 + x Fuss 
Observe how the condition |0| < 1 was used to ensure convergence of 
the infinite series. Because the w's are independent and identically dis- 
tributed, it follows that the y’s are identically distributed, and indeed 
that the joint distribution of any subset of y's will depend only on the 
differences in their time subscripts. This case is called the first-order 
autoregressive, or AR(1), process. 

Taking that stationarity for granted, the first and second moments of 
the AR(1) process are readily derived. Observe that u, is independent 
of yp 9-2; - « - . From Eq. (26.3) and the assumptions on the u, it 
follows for every t that: 


(26.43) ш = 05 + ӨЕ(у,_,) + E(u) 
> p=6+ 0ш D> p= 00/1 ~ 9), 
(26.4b) Yo = V(0y., + ш) = Oy, + 0° > yo = — 95, 
(26.4c) y; = C(8y-, + ш, у,-1) = 9V(y-)) = Өү, 
while 
Yo = C(Oy,-1 + Us уз) = 9C, уз) = Өү, = Өү, 


and similarly уз = 05y,, y, = 0*y,, and so forth. So the autocorrelations 
of the AR(1) process are p, = Ө, р; = OF ions р; = @,.... For any y 
containing n successive values of y, the SP model applies: the X matrix 
has all elements nonzero, and declining in a particular way as we go 
away from the main diagonal. 


In both of these leading examples, the y's are identically, but not inde- 
pendently, distributed. In the MA(1) process, the dependence is con- 
fined to adjacent y's, while in the AR(1) process it extends indefinitely, 
‘although with |0| < 1, the correlations do taper off in magnitude as 
the time distance between the variables increases. 

More elaborate cases can be constructed, starting again with an infinite 
sequence of independent and identically distributed u’s with E(u) = 0. 
Thus, there is the second-order moving-average, or MA(2), process: 


у, = фо + diu, + dows + ш, 


and the second-order autoregressive, or AR(2), process: 
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y = б + 0), + 023-9 + щш, 


with 0, + 0 < 1, 0; — 0, < I, and 0, > —1 to ensure stationarity. 
Mixed cases may also be constructed. For example, the ARMA(I, 2) 
process has 


M = Oo + Ө + Qiu, + Peg + М, 


with |6| < 1. These elaborations allow for more complex patterns of 
autocorrelation, and hence more flexible patterns of CEF slopes, than 
appeared in our leading examples. 

A couple of remarks: 

* In our examples, the u’s were taken to be independent and identi- 
cally distributed, so the stricter form of stationarity prevailed. In fact, 
uncorrelated u’s with constant expectation and variance would suffice 
for most purposes. 

* [n many contexts, the assumption of an infinitely long past history 
is unattractive. Stationarity for ¢ = 1, 2, .. . , can still be ensured by 
appropriate choice of initial conditions. Thus an MA(1) process can be 
started up with ир, and an AR(1) process can be started up with yọ = 


Oo/(1 — Ө) + uy/V(1 — 67). 


26.5. Sampling and Estimation 


A stationary stochastic process may be characterized in terms of the 
process parameters which consist of the $’s and/or Өз and o°. It may also 
be characterized in terms of the population moments, which consist of p, 
Yo Y» Yoo · - - (or equivalently p, Yo, р, po, . . . ). The СЕЕ coefficients, 
for any set of conditioning variables, can be derived from those. Either 
set of parameters may be viewed as interesting features of the popula- 
tion. A representation in terms of the process parameters will be more 
parsimonious, especially when a low-order AR, MA, or ARMA specifi- 
cation applies. 

Suppose that the parameters are unknown, but л consecutive obser- 
vations yj, .. . , J, generated by the process are available. We view y = 
(Yo - -+ > Ya)’ asa single drawing from a multivariate population with 
E(y) = p and V(y) = X as in the SP model. The analogy principle 
suggests that we estimate the population moments by the corresponding 
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sample moments, and then convert those into estimates of the process 
parameters. Define the sample mean, sample variance, and first sample 
autocovariance by 


(26.52 т = x yn, 
= 
(26.5b) co = У (Jı – т), 
1 


(26.5c) c = P (Ye — m)(y,-1 ~ my(n — 1). 


Similarly, the second sample autovariance is 
@ = È (J — my-2 - mln — 2), 


and so forth. The sample autocorrelations are r; = с;/со. These sample 
moments can serve as estimators of the population moments p, Yo, 
ү and ру. 

How аге they converted into estimators of the process parameters? 
There is an extensive literature on inferring the type of process by 
inspection of the sample autocorrelations: see Judge et al. (1988, 
pp. 684—705) for an introduction. But here we suppose that the type is 
known, and to illustrate, confine attention to the first-order cases. 

For the MA(1) case, the natural estimators of the process parameters 
are the solutions to the sample counterparts of Eqs. (26.2a—c): 


(26.63) m= do, 

(26.6b) со = (1 + ф?)ё?, 
(26.60) c, = 667. 

More explicitly, 


bo=m  é-[1-Vü-4n)Qn. 6 = дф. 


(See Exercise 26.2 for an explanation of the choice of root in the 
quadratic equation ту = o/(1 + $°).) 
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For the AR(1) case, the natural estimators of the process parameters 
are the solutions to the sample counterparts of Eqs. (26.4a—c): 


(26.72) т = 6,/(1 — 6), 


(26.7b) со = 67/(1 — 6%), 


^ 


(26.70) с, = Өс. 


More explicitly, 
ó = Ti; 6, = (1 — rim, ô? = (1 = тї)сө. 


In both cases, the resulting estimators of process parameters will be 
consistent (by S2, Section 9.5), if the sample moments m, со, and c, 
converge in probability to the corresponding population moments p, 
Yo and y,. This convergence will occur in general even though the 
observations are not obtained by random sampling from the univariate 
distribution of у. 

To illustrate the convergence argument, we assess the sample mean 
when the data are generated by the MA(1) process. We have y = (yi, 

-> Jn)’ as the sample observation vector, with E(y) = x, and V(y) = 
У, where x, is the n X 1 summer vector, and 


Yo Yı 0 0 0 
Yı Yo yı 0 0 
0 v1 Yo Yı 0 
у -| 0 0 т % 0 
0 0 0 0 EC Yo 


The sample mean is m = ӯ = (1/n) x;y, so 
EG) = (Un) xiE(y) = (1/п) xix = p, 
V(y) = (1/n®)x{V(y)x, = (Vn))xiXx,. 


Now xjXx, is the sum of all the elements in the X matrix, which by 
inspection of the display above is xjEx,; = nyọ + 2(n — 1)y,. So 


V(y) = (yo/n)(1 + 2p, — 2pim). 
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This differs from the random sampling result V(y) = ‘yo/n, but still 
converges to zero as n goes to infinity. So y converges in mean square, 
hence in probability, to р. For further calculations in this style, see 
Goldberger (1964, pp. 142—153). 

Under general conditions, the convergence argument extends to 
other sample moments, to the AR(1) process, and indeed to higher- 
order processes. In that event, the analog estimators of the process 
parameters will be consistent. 

At this point, we can recognize why LS linear regression provides 
consistent estimation of the CEF parameters. The normal equations that 
determine the LS coefficients in terms of observed sums of squares and 
sums of cross-products are essentially the sample counterparts of the 
equations that determine the CEF parameters in terms of population 
moments. For example, suppose that with п observations in hand, we 
run the LS linear regression of у, on (1, y,-;). There are п — 1 usable 
observations. Let 


m* = = y(n-1), me = P y- (n — 1). 
t= = 


As long as n is at least moderately large, these will differ only trivially 
from each other, and from the m defined in Eq. (26.5a). The LS slope 
will be b = Sols» where 


82 = D (Ji — таж) (п — 1), 
t=2 


$e = 2, (3-1 — m**)(y, — m*)/(n — 1). 
As long as т is at least moderately large, these will differ only trivially 
from the c, and c, defined in Eqs. (26.5b—c). Being practically the same 
as ¢,/co, the LS slope will, under general conditions, converge in prob- 
ability to y,/y = В. The argument extends to CEF's with several lagged 
values of y as conditioning variables. 


26.6. Remarks 


* For stationary stochastic processes, convergence of sample moments 
to the corresponding population moments is not inevitable. Consider 
the equicorrelated process: 
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у= Oo + и, + о, 


where the w’s are independent drawings from a distribution with E(u) = 
0 and V(u) = о”, while the random variable v is independent of the vs 
with E(v) = 0 and V(v) = 1°. Then p = E(y,) = 00, V(y) = o? + т? for 
all ¢, and further C(y,, J-j) = т? for all t and for all j > 9. All the off- 
diagonal elements of the € matrix are equal to 7?. The population 
autocorrelations are p; — T^/(a? + т?) for all j > 0. For the sample mean 
т we calculate E(m) = (1/п)х!Е(у) = p, and 


V(m) = (I/n?)x1Ex, = o?/n + (n — lyr?/n, 


which does not vanish as n goes to infinity. While unbiased, the sample 
mean is not consistent. Variants of this model are used in the analysis 
of panel data: see Greene (1990, chap. 16). 

* Nonstationary stochastic processes also arise in economic analysis. 
The simplest example is the random walk: 


у, = 3,1 + Uys 


where the v's are independent drawings from a distribution with E(u) = 
0 and V(u) = o°. It is clear that E(y,[y,.) = Jı, although the variance 
of the process must increase with t. In this case, differencing the series 
will produce stationarity. 


Exercises 


26.1 Show that for any MA(1) process, |p,| = 1/2. Is that true also 
for an AR(1) process? 


26.2 Consider these two models for an MA(1) process: 
y= фо + bunı + u, 


where the v's are independent drawings on a variable u with E(u) = 0 
and V(u) = о?, and 


Je = Фо + ф*иж у * ur, 
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where the u*’s are independent drawings on a variable u* with E(u*) = 
0 and V(u*) = о, 


(a) Suppose that ф* = 1/ and o*? = $?o?. Show that the two models 
produce the same population moments p, Yos Yis Yo · · . · 

(b) To rule out the ambiguity found in (a), it is convenient to require 
that [ф| = 1. With that in mind, explain the choice of the esti- 
mator proposed for ф in Section 26.5. 


26.3 Suppose that y, = a + By,-1 + u,, where the и, are independent 
N(0, o?) variables. You know that o = 10, В = 3/5, and c? = 2. You are 
told that уз = 50. Find the best prediction of y4. 


26.4 Suppose that y, = 1 + 0.8u,-, + 0.6u,-2 + u, where the w’s are 
independent N(0, 1) variables. For this MA(2) process, find 


Е(у,|у‹-1), Е(у,]ў-1, ў‹-2), Е(у]у‹-1, y-2; yi-5)- 


26.5 Suppose that y, = 1 + 0.45, + 0.3y,-2 + и, where the v's are 
independent (0, 1) variables. For this AR(2) process, find 


E(yl»-13, | E(uly-u у), Е(у |у, -9, ie 3)- 


26.6 Suppose that и, (t = 0,..., 100) are independent drawings from 
the N(0, 1) distribution. Consider these three models for an observed 
time series y, (¢ = 1,..., 100): 


Model 1: y, = u. 
Model 2: y, = ру,_; + su, where yo = uo, p = 0.9, s = V(1 — $°). 
Model 3: y, = (4/3) sin(7.2¢) + (1/3)u, (angle measured in degrees). 


(a) Write a program that generates 101 independent drawings from 
the N(0, 1) distribution, and then produces for each model in 
turn a y, series. 

(b) Complete the program by regressing у, оп (l, y,_;) for each of 
your three 5, series. For each regression, report X'X, X'y, b, and 
the conventional standard errors for b. 

(c) Is any of the three slopes surprising? Comment briefly. If you 
suspect that it is just a coincidence, run your program again to 
see if it recurs. 
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26.7 Table A.5 contains a data set of annual observations for the 
United States, 1956—1980, as taken from Mirer (1988, pp. 24—25). The 
variables are:. 


V1 = Identification number (1, . . . , 25) 
V2 = Year — 1900 
V3 = GNP price index (100 in 1972) 
V4 = Real GNP 
V5 = Real gross private domestic investment 
V6 = Real personal consumption 
V7 = Real disposable personal income 
V8 = Change in GNP price index 
V9 = Change in consumer price index 
V10 = Unemployment rate 
V11 = Money stock (M1) 
V12 = Treasury bill rate 
V13 = Corporate bond rate (Moody’s Aaa). 


Note: In this data set, V4, V5, V6, V7 are in billions of 1972 dollars; 
V11 is in billions of current dollars; V8, V9, V12, V13 are in percent 
per year; V10 is in percent. This data set is presumed to be available as 
an ASCII file labeled TIM. 

Run three linear regressions: 


(а) Money autoregression: y, on 1, y, ,. 
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(b) Money demand function: y, on 1, z,. 
(c) Residual autoregression: e, on 1, e, ,. 


Here y = log of real money stock = In{(V11/V3)100], z = log of real 
GNP = In(V4), and e = residual from (b). There will be 25 observations 
for (b), and 24 observations for (a) and (c). For each regression, report 
X'X, X'y, b, and the conventional standard errors for b. 


26.8 In Exercise 26.7 the slope in the residual autoregression (с) 
turned out to be substantially less than that in the money autoregression 
(a). Comment on this result in the light of your results in Exercise 26.6. 


27 Generalized Classical Regression 


27.1. Generalized Classical Regression Model 


We now generalize the classical regression model to allow the n obser- 
vations on y to have different variances and to be correlated. Again the 
data consist of the n X 1 random vector у and the n X k nonstochastic 
matrix X. The generalized classical regression, or GCR, model consists of 
these four assumptions: 


Q71) Ey) = ХВ, 
(27.2) V(y) = X, with X positive definite, 
(27.3) Х nonstochastic, 

(27.4) rank(X) = k. 


The only change from the CR model is in the second assumption: the 
elements of y now may have different variances and nonzero covari- 
ances. (Positive definiteness simply rules out situations where one of the 
ys is an exact linear function of the others.) Special cases of the GCR 
model arise when specific assumptions are made on those variances and 
covariances. 


27.2. Least Squares Estimation 


We begin with LS estimation, which produces coefficients b = Ay and 
residuals e = My, with A = Q''X', Q = ХХ, and M = I – XA. By 
linear function rules, 
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E(b) -AE(y = АХВ =B, 

V(b) = AV(y)A’ = AXA’ = Q''RQ', 
where 

КЕК 


' The LS coefficient vector b remains unbiased, but its variance matrix is 
no longer a scalar multiple of Q™’. 
Using linear function rules again, we have 


E(e = MXB = 0, V(e) = МХМ’. 
So the expected sum of squared residuals is 
E(e'e) = tr[V(e)] = tr(M=M’) = tr(M’M2) = (МУ), 


whence the adjusted mean squared residual ô? = e'e/(n — k) has expec- 
tation 


E(6?) = t(MX)/(n — k). 


This expectation involves a mixture of the elements of X, rather than 
the single parameter о? as in the CR model. (Indeed, in the ССВ model 
there may not be a natural parameter called o°.) Observe that 


МУ = (I – XA)? = X – ХО ХУ, 

"(XQ 'X'X)- un(Q 'Х'УХ) = tr(Q'R). 
So 

(МУ) = ti(X) — tr(Q7'R), 


an expression which is convenient for computational purposes. 
Proceeding to the usual estimator of V(b), namely V(b) = 6°Q™', we 
have 


E[V(b)] = [tr(MX)/(n — &)]Q7', 


which clearly differs from V(b) = Q~'RQ™’. The familiar estimator of 
the variance matrix of b is biased, so the conventional standard errors 
are not correct measures of imprecision, and consequently the confi- 
dence region and hypothesis test procedures of Chapters 19-22 will not 
be valid. For some examples of the bias see Exercises 28.1 and 28.4. 
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There are several ways to proceed at this point. One might retain the 
LS coefficient vector b as the estimator of B, and seek a correct estimator 
for its variance matrix, in order to permit valid inferences. Or—and 
this is the line taken here—one might seek a better estimator of B. The 
possibility of a better estimator is open, because the Gauss-Markov 
Theorem (Section 15.4), which established the MVLUE property of LS, 
relied on the assumption that V(y) = c?I. 


27.3. Generalized Least Squares Estimation 
For estimation of B, the key result is 


AITKEN’S THEOREM. In the ССВ model, with ¥ known, the 
MVLUE of f is the generalized least squares (or GLS) estimator, b* — A*y, 
where A* = (X'S !X) !X'X |. 


Proof. Observe that the k X n matrix A* is nonstochastic, has rank k, 
and satisfies 


A*X-I  A*XA*" = (X'E'X). 
It follows by linear function rules that 

Eb*)-B. Vb9-Q*'x 
so b* is a linear unbiased estimator of B. To show that b* has minimum 
variance in the class of linear unbiased estimators in the GCR model, 
we first show that the GCR model is equivalent to a CR model in 
transformed data, and then that b* is the LS estimator in that CR model. 

Recall a construction used for another purpose in Section 18.4. 

Because X is positive definite, we can write X = CAC’, where C is 
orthonormal and A is diagonal with all diagonal elements positive. Let 
A* be the diagonal matrix with the reciprocal square roots of the diag- 
onal elements of A on its diagonal, and let Н = CA*C', so Н'Н = X^! 
and HH’ = I. Let y* = Hy and X* = HX; these should be viewed as 
observations on transformed variables. The » X п matrix Н is nonsto- 
chastic and nonsingular, so from Eqs. (27.1)-(27.4) we deduce that: 


(27.1 E(y*) = HE(y) = НХВ = X*B, 
(97.9%) V(y*) = HV(y)H' = HEH’ = I, 
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(27.3*) X* = HX is nonstochastic, 
(27.4*) rank(X*) = rank(X) = k. 


Taken together, these say that a CR model (with о? = 1) applies to the 
transformed data (y*, X*). Because H is nonsingular, the argument 
reverses, so the GCR model for the original data (y, X) is equivalent to 
a CR model for the transformed data (y*, X*). 

The parameter vector B is unaffected by the transformation. By the 
Gauss-Markov Theorem itself, among all linear functions of y* that are 
unbiased for B, the one with minimum variance is the coefficient vector 
in LS linear regression of y* on X*, namely 


c* = (X*X*) '!X*'y*. 


But X*X* = (X'H)(HX) = X'X'X, and X*’y* = (X'H’)(Hy) = 
X'X y, so c* = b*. Finally, because H is nonsingular, the class of linear 
functions of y* is the same as the class of linear functions of y. We 
conclude that in the ССВ model, b* is the MVLUE of В. & 


27.4. Remarks on GLS Estimation 


The following remarks may aid the interpretation and implementation 
of GLS estimation. 

* Why is b* referred to as the "generalized least squares" estimator? 
Being the LS coefficient vector for the transformed data, b* is the value 
for с that minimizes the criterion ф*(с) = u*'u*, where 


u* = y* — X*c = H(y — Xc) = Hu, 


with u = y — Xc. But H'H = X^, so фс) = u'X 'и. This criterion is 
a positive definite quadratic form in u, which is indeed a generalization 
of the sum of squares, u'u. 

* The device of transforming the data, used to establish the optimality 
of b*, also provides a computational routine. With X known, to do GLS 
estimation, transform the data and run LS. 

* The transformation is not unique. For example, instead of using 
H = CA*C', we can use H* = A*C', because 


H*'H* = CA*A*C' = САС’ = (CAC) = E, 
H*ZH*' = A*C'(CAC')CA* = A*AA* = I. 
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So а CR model (with о? = 1) will apply to the data y** = H*y, X** = 
H*X. Despite the nonuniqueness of the transformation, the estimator 
b* will be unique, as is readily confirmed. For any GCR model, there 
are in fact many nonsingular matrices that will transform it into a CR 
model. АН produce the same b*, so it is strictly a matter of convenience 
as to which version of H should be used in practice. 

* To obtain the GLS estimator, it suffices to know X up to a scalar 
multiple. Suppose that X = о“, where Q is known but the scalar о? 
is unknown. Since Q is positive definite, we can find an Н? such that 
HH? = Q^! and НОН” = I. Let y? = H^y and X° = H°X. Then 


E(y) = Н°Е(у))  -HXB = ХВ, 
V(y) = H"V(yH" = WEH” = H'(o^0)H"' = o'I, 


and X? is nonstochastic with full column rank. So a CR model (with o* 
unknown) applies to (y°, X°). The LS estimator on the transformed data 
is 


Ь° = ос") pb 


But X"X^ = X'H"H?X = X'Q^'X = X' (Zio) 'Х = o^X'X 'X, and 

similarly X^'y? = xU 'y, so b? = b*. When such an Н? is used, it will 

remain to estimate с? and V(b*), but for that task all of CR theory 

applies. So knowledge of £X up to proportionality suffices to calculate 

the GLS estimator b* and to estimate its variance matrix unbiasedly. - 
* With respect to goodness of fit: let 


o(c) БЕ u'u, ф*(с) = u*'u* = u'u, 


where u = y — Xc and u* = y* — X*c. Clearly ф() = $(b*) (with 
equality iff b = b*) so LS gives a better fit than GLS to the original 
data. By the same token, ф*(Ъ*) = ф*(Ь) (with equality iff b = b*), so 
GLS gives a better fit to the transformed data. 

* It makes no sense to compare (b) with *(b*), as is occasionally 
done; the two criteria may not even be in the same units. A fortiori, it 
makes no sense to compare an R^* calculated from the LS regression 
of y* on X*, with the R? from the LS regression of y on X; the depen- 
dent variables y and y* are different, and furthermore the X* matrix 
may not contain a summer vector even when X does. 
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27.5. Feasible Generalized Least Squares Estimation 


As we have seen, calculation of the GLS estimator b* requires knowledge 
of V(y) = X at least up to proportionality: 


b* = A*y, 
where 
A* -(X'Q^'!Xy!x'Q!- (X'S XO X'S’, 


and Q, a scalar multiple of € = V(y), is known. In practice Q will be 
unknown, so that GLS estimation will not be feasible. A feasible gener- 
alized least squares, or FGLS, estimator of B is defined by 


b* = A*y, where A* = X'S xy XÉ, 


with X being an estimator of X. 

The properties of an FGLS coefficient estimator b* depend on the 
properties of the variance-matrix estimator 2. The key result is that if 
X is a consistent estimator of X, then under general conditions the 
FGLS estimator b* has the same asymptotic distribution as the GLS 
estimator b*. For discussion of the conditions, see Greene (1990, pp. 
388—390) or Judge et al. (1988, pp. 352-356). That is to say, for large 
n, the distinction between the distributions of b* and b* is negligible. 

For some insight into this conclusion, recall two previous situations. 

(1) In random sampling from a bivariate population, the sample LP 
slope (which uses deviations from the sample means) and the ideal 
sample LP slope (which uses deviations from the population means) 
have the same asymptotic distribution: see Section 10.5. 

(2) In the CNR model, the statistics 

u = (b — B/G, and 5 = (b; — Blo, 


7 


have the same asymptotic distribution, namely N(0, 1): see Section 22.6. 
These examples are suggestive, because they show that replacing an 
unknown parameter (p, in the first case; с? in the second) by a consistent 
estimator may make the statistic feasible to calculate without affecting 
the asymptotic distribution. 

To obtain a consistent estimator of X for use in FGLS estimation, it 
is natural to rely on the residual vector e from LS regression of y on 
X. After all, X = V(y) = V(e) where € = y — XB, and e = y — Xb. How 
the residual vector should be used, and whether the quality of the 
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resulting estimator of X is adequate to ensure that FGLS has the same 
asymptotic distribution as GLS, depends on the context of special cases. 
Those special cases are defined by the “structuring” of X, where "struc- 
turing” means specifying that = X(0), where Ө is an unknown param- 
eter vector with a relatively small number of elements. We will explore 
two leading special cases in Chapter 28. For the present, observe that 
unless such knowledge is available, no consistent estimator of X will be 
obtainable. After all, there are n(n + 1)/2 distinct elements in E, and 
one can hardly be optimistic about estimating so many distinct param- 
eters (along with the k elements of В) when only n observations are in 
hand. 


27.6. Extensions of the GCR Model 


The GCR model may be extended into a generalized neoclassical regression 
model, by allowing X to be stochastic: 


E(y|X) = XB, 

V(y|X) = X, with X positive definite, 
X random, 

rank(X) = k. 


Again no new theory is required. Just transform into a NeoCR model 
via an appropriate H matrix, and apply the theory of Chapter 25. 

The GCR may be strengthened into a generalized classical normal regres- 
sion model, by adding a normality assumption: 


y ~ N(XB, 2), 

X positive definite, 
X nonstochastic, 
rank(X) = k. 


No new theory is required here. If X is known up to proportionality, 
then transform the data into a CNR model via an appropriate H matrix 
(which will also preserve normality) and apply the statistical inference 
theory of Chapters 19-22. Among other things, b* will be the ML 
estimator. If X is unknown, but structured as €(0) in the sense defined 
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above, then the likelihood function may be maximized with respect to 
В and Ө jointly. This will produce a different estimator of В than that 
obtained by FGLS (which first estimates Ө, and then uses the implied 
estimate of 2 = (Ө) to estimate В). Under general conditions, this 
ML estimator will have the same asymptotic distribution as the FGLS 
and GLS estimators. For discussion, see Amemiya (1985, pp. 190—191, 
200-203). 


Exercises 


27.1 Suppose that the GCR model applies to E(y) = Xf, V(y) = X. 
Let b, e, and ў denote the coefficient, residual, and fitted-value vectors 
for LS regression of y on X, and let b* denote the GLS coefficient 
vector. For each of the following statements, determine whether it is 
true or false: 


(a) The covariance matrix of b and b* is equal to the variance matrix 
of b*. 

(b) If tis a linear unbiased estimator of В, then the covariance matrix 
of t and b* is equal to the variance matrix of b*. 

(c) The covariance of each element of ў with the corresponding 
element of e may be nonzero, but the sum of those covariances 
is zero. 


27.2 Determine whether the following statement is true or false: Sup- 
pose that the CR model applies to E(y) = XB, that T is a nonstochastic 
nonsingular matrix, and that y* — Ty, X* — TX; then GLS regression 
of y* on X* gives the same coefficient estimates as LS regression of y 
on X. 


27.3 With data drawn from a GCR model, a researcher first ran LS 
regression using her own LS program to obtain coefficients and stan- 
dard errors. Then she was given the true X (up to a scalar multiple). 
She transformed the data appropriately, and ran LS on the transformed 
data, using the same program to obtain coefficients and standard errors. 
For several coefficients, standard errors in the second run were larger 
than those in the first run. Does this contradict Aitken's Theorem? 
Explain. 


28 Heteroskedasticity and Autocorrelation 


28.1. Introduction 


We sketch two leading special cases of the GCR model: pure hetero- 
skedasticity, and a nonstationary first-order autoregressive process. For 
more complete treatments see Greene (1990, chaps. 13, 14, 15), Judge 
et al. (1988, chaps. 8, 9), or Amemiya (1985, chap. 6). 


28.2. Pure Heteroskedasticity 


In the pure heteroskedasticity case, the угѕ are uncorrelated, but have 
different variances: the matrix X is diagonal, with diagonal elements 
85... 05... , аё. This case will arise when, in the underlying 
multivariate population, the conditional variance function V(y|x) is not 
constant across x. An n X n matrix H that makes HXH' = I is the 
diagonal matrix that has the 1/0; on its diagonal. If the os are known, 
then we can transform the data by dividing all variables at the ith 
Observation by c, to get 


ж 


у 759, хў = xjl; 


The CR model will apply to the new data and LS regression of y* on 
X* will produce the GLS estimator b*. 

Scalars proportional to the c; may be used instead. Suppose we know 
that V(y|x;) = сог, where the 9? are known but c? is unknown. Divide 
all variables at the ith observation by «,, and run LS on the transformed 
data, to obtain b*. An often-cited example of this arises when the 
variances are proportional to the square of one of the explanatory 
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variables, say x, (where x, is always positive). Then division by x, which 
makes the transformed variables ratios of the original variables, is the 
appropriate transformation. 

If the о? are not known up to proportionality, but some structuring 
of them is known, then FGLS may be available. 

Suppose we know that the с? have only two distinct values: o; for 
observations 1,..., n, and w for observations n, + 1,..., n, but we 
do not know the values of the two w”’s. Then this version of FGLS is 
natural: Regress y on X to get the residual vector e. Partition the residual 
vector e as (el, e))'. Let w? = efe,/n,, and w3 = ejes/ns. Transform the 
data by dividing the first n, observations through by w;, and dividing 
the remaining no = n — n, observations through by wə. Run LS on the 
transformed observations. 

Or, suppose we know that V(y|x) = g(x; 0) where the function g(x; -) 
is known except for the r X 1 parameter vector Ө, with r much less than 
n. One possibility here is that V(y|x) = exp(@’x). Because log V(y|x) = 
log E(e”|x) = a'x, the following application of FGLS seems natural: 
Run the LS regression of y on the x’s, obtaining residuals ej. Then run 
the LS regression of the log е? on the x's to estimate а, and calculate 
ô? = exp(&'x,). Transform the observations by dividing through by б, 
and run LS on the transformed data. (Note the informal flavor of this 
approach; after all, V(log у) ¥ log(V(y)].) 

In each of these cases, the prior structuring of the о? serves to reduce 
the number of unknown parameters. Only then can estimators of X be 
obtained that have sufficient reliablity to ensure that b* shares the 
asymptotic distribution of b*. 

However, under pure heteroskedasticity, if the objective is merely to 
obtain valid estimates of the variances of the LS coefficients, then such 
structuring is not needed. Let ô? = e?, where the е; are the residuals 
from LS regression. Let 2 be the diagonal matrix with the ô? on the 
diagonal, and let Å = X'XX. Then V(b) = О ОГ! will under general 
conditions provide a valid estimator of V(b). A similar procedure was 
introduced in discussing BLP estimation in Section 25.5. 


28.3. First-Order Autoregressive Process 


Suppose that 2 = o?Q, where 
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2 3 =I 


n-2 


о о UO 


-8 


n—4 


p p p p dE 
with —1 < p < 1. This says that C(y,, y) = g^ pl, With V(y) = 
ap! = ө? for all i, the y/s are homoskedastic. But with p » 0, they 
are correlated. We refer to this as the first-order autoregressive, or AR(1), 
case of the GCR model. 

'The successive drawings on the y; are not independent, and indeed 
this AR(1) specification is intended to apply to time series data. In terms 
of the development in Chapter 26, we have combined a regression 
model for E(y) with a time series model for V(y). Referring to Section 


26.4, we see that a mechanism that will produce the present specification 
is given by: 


(28.1) у= W; te; 
(28.2) є = pei, + u; 


where the p; = x;B are nonstochastic, and the u; are independent and 
identically distributed with E(u) = 0, V(u) = o?. It follows that 


E(yj) тг His 
V(y) = V(e) = o2/(1 — р?) = о?, 
Cy. yi-1) = po’, 


and so forth. In terms of the discussion in Section 26.2, here a stationary 
population model applies to the disturbances e;: the y/s themselves have 
different expectations and hence the process generating them is not 
stationary. Still, the parameter p is the population first autocorrelation 
coefficient of the disturbances є;, and also of the dependent variable y;. 

If p is known, then X is known up to proportionality, so GLS may be 
implemented. Define the n X n matrix 
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v(1—-p?) 0 0 0 0 
—p 1 0 0 0 
0 —p 1 0 0 
Н = , 
| 0 0 0... -p 1 


and observe that HEH’ = o*(1 — p?)L which is proportional to I. So 
transformation of the data by this H will produce variables on which 
LS can be run to obtain the GLS estimator b*. 

To clarify the transformation, let y* = Hy, X* = HX. Then 


Ў = МГ ру, xt-V(- px, 

YË = у; — руу, хў = x; — рх; @ = 2,..., п), 
with 

E(y#) = x?'B, VOS = (1 - PWO) = oe 


Apart from the special treatment of the first observation, this transfor- 
mation can be rationalized directly by reference to the underlying 
process in Eqs. (28.1)-(28.2): Lag Eq. (28.1) to get yi, = шу + 6-1, 
multiply that by p, and subtract from Eq. (28.1), to get 


(28. y —py-i17 шт Phi- +6 Pei 
xiB — рх; .В +u; 
= (х; — px:-1)'B + ш. 


This says that уў = x*’B + u; where the u,s have zero expectation, 
constant variance, and are uncorrelated. 

If, as in practice, p is unknown, then GLS cannot be implemented. 
But the following application of FGLS is natural and is commonly used. 
First run the LS regression of y on X to get the residuals e. Then 
regress e; on é;_, (across? = 2,..., n) to estimate p as 


No intercept is required when the sum of the residuals is zero. Trans- 
form the data as above, using ф in place of p, then run LS on the 
transformed data. Under general conditions, this FGLS procedure will 
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give estimators with the same asymptotic distribution as GLS. The 
rationale for the estimate of p is straightforward. In this model, 


р = C(Yis yiY/V(y) = С(є, e; .)/V(e, 


where є; = y; — i, with p; = х;В. But e; = y; — р, with à; = x;b, so the 
residuals e; are "predictors" of the corresponding disturbances є;. As а 
consequence, the sample moments of the residuals will consistently esti- 
mate the population moments of the disturbances. 

An alternative approach may be more attractive when p is unknown. 
Rewrite Eq. (28.3) as 


(28.4) у= хү + хуу + Yi-1Ys + v; = BY + м, 
where 

ү=В, v--Be тз = р, 

Z = (Xi, х,у), Y = (Yb Yè Үз). 


Because E(y;|z;) = ziy, with the u/s homoskedastic and uncorrelated, 
we may fit Eq. (28.4) by LS, using observations i = 2, .. . , n. Neither 
the CR nor the NeoCR model is applicable, because the lagged value 
of the dependent variable appears among the explanatory variables. 
Indeed the LS estimates of Eq. (28.4) are not unbiased, but under 
general conditions, they are consistent. The argument is similar to that 
in Section 26.5. 

In fitting Eq. (28.4), one may well want to impose the restriction that 
Yo = —'Үүүз. If so, the required LS algorithm will be a nonlinear one. 
More on this in Chapter 29. Under general conditions, the resulting 
estimator will have the same asymptotic distribution as the GLS esti- 
mator. Under normality, another alternative is available: the likelihood 
function may be maximized jointly with respect to В, o°, and p. 


28.4. Remarks 


* In empirical research, when the AR(1) specification is entertained 
but p is unknown, it is customary first to test the null hypothesis p — 0. 
If the null is accepted then LS estimates are used; if it is rejected then 
FGLS estimates are used. This procedure is a variant of the pretest 
estimation strategy introduced in Section 24.4. 
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* The traditional statistic for testing р = 0 is the Durbin-Watson statistic 
d, which is virtually equal to 2(1 — р). Tables of critical values are 
provided in most econometrics textbooks, along with rather complicated 
instructions. A considerably simpler test procedure, which has an 
asymptotic justification, treats Vnp as а N(0, 1) variable on the null 
hypothesis p — 0: see Judge et al. (1988, pp. 594—401). 

* A significant value of d or of ф should not be read automatically as 
evidence in favor of the AR(1) specification. After all, many other 
stationary population models also generate first-order autocorrelation 
in the residuals. Examination of higher-order residual autocorrelations 
may suggest a more appropriate specification for X. Furthermore, as 
seen in Section 26.1, changing expectations can produce a series that 
appears to be autocorrelated. So omitting an explanatory variable that 
is itself autocorrelated may well produce autocorrelation in the 
residuals. 

* The situation changes drastically if the lagged value of the depen- 
dent variable appears as one of the original explanatory variables, while 
an AR(1) process is entertained for the disturbances. Then the FGLS 
approach described above is inappropriate, as are the above tests for 
p = 0. The easiest way to see this is to recognize that the original function 
no longer has a CEF interpretation. For a simple example, suppose that 


(28.5) y-a-ctfy,te, 
(28.6) €; —pe + ш, 


where the u/s are independent and identically distributed with expec- 
tation zero and variance o^, while |B| < 1 and |p| < 1. Then 
E(y;|y,-1) # о + By;-ı because С(у;_1, €;) # 0. To proceed, lag Eq. (28.5) 
to get yj, = а + Зу; + є;_,, multiply that by p, and subtract from Eq. 
(28.5), to get 


9i — Pyi-i = e(1 — p) + Ву, — рву» + €i — pei, 
whence 
(28.7 у= a(1— р) + (B + pi — pBy-s + ш. 
Because u; is independent of past y's, this says that 


E(ylyi- i уә) = a(l — р) + (B + p» — pBy,~2, 
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which is a special case of a stationary AR(2) specification for the observed 
variable y. Regressing y; on (1, yj, y;-9) will give consistent estimates of 
a(l — p), (B + p), and —pf. But a first-step regression of y; on (1, y,. ,) 
is a short regression that will not consistently estimate B (nor B + p, for 
that matter): see Exercise 28.7. With the first step invalid, the remainder 
of the FGLS procedure will also be invalid. The residuals from the first 
step will no longer be valid predictors of the e's, and hence the rationale 
for using them to estimate p will vanish. 


Exercises 


28.1 Suppose that for i = 1, 2,..., 20, the random variables y; are 
independent with E(y,) = a + Bx,, V(y) = o^, where x, = i and o? = 
2. Set this up as a GCR model with n = 20, k = 2. Let b and b* denote 
the LS and GLS estimators of the slope В. Also let ôf denote the 
conventional estimator of the variance of b. 


(a) Calculate 07, oj, and E(67). 
(b) Comment on the results. 


28.2 A researcher believes that the disturbance variance at each obser- 
vation is proportional to the square of the third explanatory variable, 
so she divides each observation through by the third explanatory vari- 
able before running an LS regression. However, in reality, there was no 
heteroskedasticity; the CR model was appropriate for the original data. 
Will her coefficient estimators be unbiased? 


28.3 Suppose that y, = 0 + €,, y, = 20 + €z, and уз = 30 + єз, where 
the parameter Ө is unknown, while є}, є, and єз are independent with 
zero expectations and variances с = 4,02 = 6, тз = 8. Find the МУШЕ 
of 0. 


28.4 Suppose that for t = 1, 2,...,20, the random variables y, have 
the AR(1) disturbance pattern, with E(y,) = a + x, V(y) = o? = 2, 
p = 0.8, and x, = t. Set this up as a GCR model with n = 20, k = 2. Let 
b апа 6* denote the LS and GLS estimators of the slope В. Also let 6; 
denote the conventional estimator of the variance of b. 


(a) Calculate of, of, and E(62). 
(b) Comment on the results. 
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t = seqm(a,d п) ўз an ri^XCT vecti: whose eléfietits are, ad, ^ 
аа": аа" 


28 5 Suppose that y, = Ө + є), у = 0 + є, уз = 0 + єз, where e, = 
Ug + щу, єў = Uy + us, Єз = us + us, while uo, Ui, us, из are independent 
N(0, o?) variables. The parameters 0 and с? are unknown. You are 
given one observation on each of the three y's. Determine which of these 
two estimators of 6 is preferable: 


y (уу + yo + 3/9, m = (y, + 2. 


28.6 In the AR(1) case of the GCR model, the parameter p is inter- 
pretable as the population first autocorrelation coefficient of the y's (as 
well as of the e's). So it is proposed to take the sample first autocorre- 
lation coefficient of the y’s, namely the sample correlation of y; and y,_,, 
as an estimate of p. Evaluate the proposal. 


28.7 Consider the model of Eqs. (28.5)—(28.6), which involves a lagged 
dependent variable and autocorrelated disturbances. For convenience, 
take the и> to be normally distributed. Let E(y;|y;-1) = a* + B*y. ,. 


(a) Find a* and f* in terms of the parameters а, f, p, o°. 
(b) What parameters would be consistently estimated by a LS linear 
regression of y; on (1, y;,)? 


29 Nonlinear Regression 


29.1. Nonlinear CEF's 


We have been concerned with linear CEF's, that is, with populations in 
which E(y|x) = x' is linear in the parameter vector B. Some “non- 
linear" CEF's can be cast in that form, as noted in Section 13.3. But 
inherently nonlinear CEF's also arise, that is, populations in which 
E(y|x) = A(x, Ө), with A(., -) being nonlinear in the unknown parameter 
vector Ө. In such situations, of course, we may run LS linear regression 
to estimate the BLP E*(y|x) = х'В, but we now suppose that we are 
interested in the CEF itself rather than in the BLP. 

For random sampling from the multivariate population, we find that 
nonlinear least squares estimators are consistent, and sketch their 
asymptotic distribution. (Random sampling is adopted for convenience; 
the nonstochastic X case would be treated similarly. We also discuss 
instrumental-variable estimation, and maximum-likelihood estimation 
for a binary response model. All this is an extension to the multivariate 
case of material developed for the bivariate case in Sections 13.3 and 
13.4. 

As background, here are some examples of nonlinear regression 
models. 

Cobb-Douglas Production Function. Let y — output, x, — labor input, 
хг = capital input, and suppose that 


E(y|x) = 09xi'x*. 
In this СЕЕ, the parameters 6,, 0,, 0; enter nonlinearly. 


Linear Regression with AR(1) Disturbances. Yn Section 28.3, we found 
that a linear regression model with AR(1) disturbances implies 
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E(yÀz;) = ZY, 
where 
2; = (Xi, Xi Ji), — Y = (Yo Yo Үз), 


yı = В, Yo = – Вр, Үз = р. 


In this СЕЕ, the linear regression parameters 'y,, Yo, үз are subject to 
the nonlinear constraint Yə = — 'y,'ys. Equivalently, the СЕЕ is nonlinear 
in the underlying parameters В, p. 

Binary Response. 1f y is a binary variable taking on only the values 0 
and 1, then it is implausible that the CEF be linear in the explanatory 
variables. A linear function is unbounded, while E(y|x) = Pr(y = 1|x) 
is inherently bounded between 0 and 1. A plausible form for the CEF 
is E(y|x) = F(x'0), where F(.) is the cdf of some continuous distribution. 
This CEF is nonlinear in the parameter vector Ө. 

In the probit model, F(-) is taken to be the standard normal cdf. Here 
is a simple scheme that supports the probit model. Suppose that y*, the 
unobserved propensity to own a car, is determined by a normal linear 
model: 


ук = x'0 — є, 


with є ~ (0, 1) independently of x. (Writing —€ rather than +e is 
purely a matter of convenience.) Suppose further that y, the observed 
binary variable that indicates actual ownership, is determined as 


Let A be the event that y = 1. Now 

A = {y = 1} = {y* = 0} = {x'0 — e = 0} = {є = x'0). 
With є ~ N(0, 1) independently of x, it follows that 

E(y|x) = Pr(A|x) = F(x'0), 


where F(-) is the standard normal cdf. 

In the probit model, as in many nonlinear regression models, the 
parameter vector 0 does not directly give the "effects" of explanatory 
variables on the conditional expectation of the dependent variable. 
Because E(y|x) = F(x'0), we have dE(y|x)/ax; = f(x'0)0;, where f(-) is 
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the standard normal pdf. These derivatives vary with x: at any value of 
x, they are proportional to the coefficients 6;. 

A popular alternative to the probit model arises if the € above has 
the standard logistic, rather than the standard normal, distribution. The 
standard logistic cdf is (see Section 2.3) 


С(є) = exp(e)/[1 + exp(e)]. 


So in the logistic model for binary response, the СЕЕ is E(y|x) = G(x’), 
which differs somewhat from that in the probit model, but is again 
nonlinear in Ө. 

Censored Dependent Variable. As the previous example suggests, non- 
linear CEF's may arise when an underlying dependent variable has a 
linear model, but is not fully observable. Let y = dollars spent on 
purchase of a new car. So y = 0, and many families will have y = 0, 
features that would be incompatible with a normal linear model for y. 

A simple scheme that may be appropriate is the Tobit model. Suppose 
that y*, the unobserved propensity to spend on a new car, is determined 
by a normal linear model: 


y* = x'0 – сє, 


where є ~ (0, 1) independently of x. Suppose further that y, the 
observed continuous variable that measures actual expenditure on a 
new car, is determined as 


.]»* ify*z0, 
32 [0 ify* <0. 


Then 
E(y|x) = F(x'0/o)x'0 + of(x'0/o), 


where f(-) and F(-) are the N(0, 1) pdf and cdf. This СЕЕ is nonlinear 
in the parameters Ө, с. | 


Proof. Let А be the event that у* > 0. Now 
A = {y* = 0} = {х'0 — oe = 0} = {є = x'060] = {є = т}, 


say, where т = x'Ü/o. With є ~ (0, 1) independently of x, it follows 
that Pr(A|x) = F(t), and that 
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E(e|x, A) = Е(є|є = т) = L tf(t) a/f fo dt. 


Now, for the standard normal density, 
Й) = (22) ^ exp(-£/2) > f'() = aftyat = fÀ, 


and of course f f'(t) dt = f(t). So E(e|x, А) = —f(t)/F(t). Further, if A 
does occur, then y = x'0 — сє, so 


E(y|x, A) = x'9 — cE(e|x, A) = от — o[-f(*F(3)] 
= oft + f(x)F(1)]. 
By the Law of Iterated Expectations (T8, Section 5.2), 
E(y|x) = Pr(A|x)E(y|x, A) + Pr(not A|x)E(5|x, not A) 
= Pr(A|x)E(y|x, A), 
using the fact that y is identically zero if A does not occur. So 
E(y|x) = Е(т)с[т + f(syF(:)] = Е(т)ст + ofar) 
= F(x'0/0)x'0 + of(x'0/c). m 


29.2. Estimation 


With that as background, we turn to the general case of nonlinear 
regression. Consider a multivariate population in which 


E(y|x) = k(x, Ө), 


the function A(., -) being known apart from the k X 1 parameter vector 
0. (Caution: The number of explanatory variables may differ from the 
number of parameters, as in the AR(1) and Tobit examples above.) 
For estimating this nonlinear CEF, we draw on the analogy principle, 
as we did for the bivariate case in Chapter 13. In the population, the 
CEF is the best predictor of y given x. In particular, it is the best 
predictor in the class h(x, c), where c denotes a k X 1 vector. Thus, 0 
is the value for c that minimizes E(u?) in the population, where и = 
y — h(x, c). So, given a random sample of observations (y; x;) (і = 1, 
. , п), let us choose, as our estimate of Ө, the value of c that has the 
corresponding property in the sample, namely minimizing the sample 
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mean squared residual, or equivalently the sample sum of squared 
residuals. 

Proceeding, let 6 = ф(с) = > u7, where и; = y; — h(x; c), and choose 
€ = (с, ..., с) to minimize o. The derivatives are: 


ablac; = У, (аи 9с) = У, 2u(0u/àc) = 2 У u(-23) = -2 У шщ, 
where 
ы = Oh(x;, с)/дс,. 
The first-order conditions (FOC's) for a minimum are 


Exwu-0 (ў=1,...,®), 


or in matrix form, 
Z'u = 0, 


where Z = {z,} is the n X k matrix of derivatives of the regression 
function h with respect to the c’s, and u is the n X 1 vector of deviations 
of у from А. This is a system of k nonlinear equations in cj, . . . , сь: the 
c's enter both the и and the z's nonlinearly. Let t denote the solution 
value for c; that is, Z'à = 0, where Z and à denote Z and u evaluated 
at ¢ = t. Provided that this locates the global minimum, we refer to t as 
the nonlinear least squares, or NLLS, estimator of 0. 

The FOC's of NLLS, namely Z'u = 0, have a striking resemblance to 
the FOC's of linear LS, namely X'u = 0. Indeed, if h(x, 0) were linear 
in Ө, that is, if h(x, Ө) = x'0, then z; = x; and Z = X. In that event, Z 
would not involve c, and u = y — Xc would be linear in c, so the FOC's 
would be linear in c, and would be solved analytically to get the familiar 
c = (X'X) ‘X'y. 

As in Section 13.3, the analogy principle also offers an alternative to 
NLLS estimation, namely instrumental variable, or IV, estimation. In the 
population, the deviations from the CEF have zero expected cross- 
product with each conditioning variable. That is, let u = y — A(x, c); 
then Ө is the value for c that makes E(x,u) = 0 for every j. This 
suggests that we choose, as our estimate of Ө, the value of c that has 
the corresponding properties in the sample, namely satisfying the equa- 
tions 


py xju; = 0. 
2 
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Provided that the number of conditioning variables is the same as the 
number of parameters, this is a system of А nonlinear equations in су, 
..., бы: the с enter the v's nonlinearly. If the number of conditioning 
variables is less than k, then we may use other functions of the x;, which 
also have zero expected cross-product with u in the population, to 
complete the set of instrumental variables. If the number of condi- 
tioning variables exceeds k, then we may use a subset of them, or seek 
to combine them optimally, as instrumental variables: for discussion and 
references, see Manski (1988, chap. 6). 

Observe that the FOC's of NLLS can be interpreted as choosing c to 
make the sample summed cross-products of z; = dh(x, ¢)/dc; with u equal 
to zero. So NLLS has an IV interpretation: 2, is a function of x (not of 
y), and we know that in the population, the expected cross-product of 
every function of x with y — h(x, Ө) is zero. 


29.3. Computation of the Nonlinear Least Squares Estimator 


The nonlinearity of the FOC's has implications for computing the solu- 

tion and also for the properties of the estimator. Because of the nonlin- 

earities, the FOC's are solved numerically rather than analytically. 
Here we sketch an NLLS algorithm for the case in which there is 


a single parameter and a single explanatory variable—for example, 
h(x, 0) = x°. Let 


h = h(x, c), z = dh/dc = z(x, c), u=y-h= uly, x, с). 


We seek the value of c that makes z'u = 0. Let c? be an initial guessed 
value for c and define 


h? = h(x, с), 2° = dhlic? = z(x, С), u? = у — № = uly, х, С). 
The linear approximation to h at the point c? is 
В = В + 2°(с — с), 
so to that order of approximation, 
u =y- h= y- [k + z(e > e= u o aze c. 


Applied to all n observations, this says 
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u = w — zc — c). 

So to the same order of approximation, 
ф(с) = u'u = ци + (с — CYF — Ac ezne, 
ob'(c) = db(c)/dc = Uc — c)z"'z — 22°'u?. 

Set ф/(с) = 0 and solve for 


c— c = zuz. 
Take the resulting c as the new c? and restart the calculation. Continue 
until convergence, that is until c — c = 0, where “= 0" indicates 
satisfaction of a convergence criterion such as being less than 0.0001 in 
absolute value. At that point, z'u & 0, as desired. 

A few remarks: 

* The derivative 2° = dh/dc° may be approximated numerically as 


[h(x, c^ + p) — A(x, с° — p)V(2p), 


where p is a (small) step. 

* The expression z?'u*/z^z? may be interpreted as the coefficient in 
LS linear regression of u° on 2°. 

* The algorithm generalizes in a fairly obvious way to the multi- 
parameter case: see Judge et al. (1988, pp. 501—510) or Greene (1990, 
chap. 12), and see also Exercise 29.3. 


29.4. Asymptotic Properties 


Because the NLLS estimator t is a nonlinear function of y, its sampling 
properties are not readily obtainable as they would be in the linear case. 
Indeed exact results are not available, but asymptotic theory is available 
for random sampling from a multivariate population. For convenience, 
we sketch the one-parameter case, and proceed quite informally. 

For the population, define the random variables 


u = y — h(x, 9), 2 = ðh(x, 0)/90, 5 = zu. 
Let | 

w = —д5/д0 = —[д(ди/д0) + u(32/90)] = 22 — u(02/00), 
using ди/90 = —dh(x, 0)/90 = —z. We have 
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Е(и) = 0, E(zu)=0,  El(0z/00)u] = 0, 


because и is the deviation from the СЕЕ while 2 and 22/90 are functions 
of x, not of y. So 


E(s)=0,  Vs)-EG)-E(QW) Е(ш) = Е(2?). 
Correspondingly, for the sample, define the variables 
a; = Yi jet À(x;, t), 2; = 9h(x;, tyot, $; = £i, 


where ¢ denotes the solution value that makes 2$; = 0. A linear approx- 
imation at the point Ө gives 


& = s; + (05,/00)(t — 0) = s; — w(t — Ө), 


so 


25 =0=D s- > wt- 0). 


Neglecting the approximation error, we have 
(2 — 0) = ха w; = s/w, 

say, whence 
Vin(t — 0) = VnGlz). 


Once again, as in Section 12.3, we see a complicated sample statistic 
exhibited as a ratio of sample means. Here 5 = (1/n) 27,5, is the sample 
mean in random sampling on the random variable s. Because E(s) — 0, 
the CLT implies 


Vins > NTO, V(s)]. 


Further, w = (1/n)Zjw; is the sample mean in random sampling on the 
variable w, so w > E(w) by the LLN. Then the Slutsky Theorem S4 
(Section 9.5) implies 


Valt — 0) > NO, $°, 
with 
$? = VASE? (w) = Е(22и?)/Е?(2?). 


(Caution: Do not confuse this @ with the @ = (c) of Section 29.2.) 
Equivalently, we say that 
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t & N(0, ф2/). 


We see that the NLLS estimator is consistent, although not unbiased, 
In fact, no unbiased estimator of 0 exists in nonlinear regression models. 
In practice, ф? will be unknown. The natural estimator is 


(291) ф = (2 ёё) / (= gn). 


We can construct confidence intervals and test hypotheses on Ө in the 
usual manner, relying on asymptotic normality. Thus, t + 1.966/V will 
provide an approximate 95% confidence interval for 0. 

For the multiparameter case, the results generalize directly. Let z — 
{ah(x, Ө)/90} be the k X 1 vector of derivatives of the СЕЕ with respect 
to its parameters. Then 


Vn(t ~ Ө) 2» (0, Ф), 
with 
Ф = [E(zz')] 'E(zz'w )E(zz)] '. 


Here the matrix 2,2,2//n will serve as the estimator for E(zz'), while the 
matrix Z,22/02/n will serve as the estimator for E(zz'v?). 

A few remarks to conclude this discussion of NLLS estimation: 

* There is a formal resemblance between the expression for ® and 
the variance matrix of the linear LS estimator in the GCR model (Section 
27.2), namely V(b) = (X'X) !X'ZX(X'X) ', and also the formula for 
the asymptotic variance matrix of the LS estimator of a BLP (Section 
25.5). 

* Suppose that the population conditional variance function is con- 
stant: V(y|x) = E(u’ |x) = o, say. Then E(zz'w)) = o°E(zz'), and Ф 
simplifies to Ф = g?[E(zz/)] '. There is a formal resemblance between 
this expression for ® and the variance matrix for the linear LS estimator 
in the CR model, namely V(b) = o*(X'X) '. 

In the present homoskedastic case, the natural estimator for ® will 
be 


^ 


—1 
Ф = 6° (x эт) ; 


with ô? = >,02/л. Standard computer programs for NLLS are likely to 
incorporate this estimator rather than the more general form. 
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* Suppose that in the population, у|х ~ N[h(x, Ө); o°]. Then, provided 
that the distribution of x does not involve Ө, the NLLS estimator of Ө 
is also the ML estimator of Ө. The argument here is the same as that 
for the linear CNR model (Section 19.2). 


29.5. Probit Model 


For a specific example of nonlinear CEF’s, return to the probit model. 
Here y is a binary variable (equal to 1 or 0), and the CEF is 


E(y|x) = Pr(y = 1|x) = F(x'0), 


with F(-) denoting the N(0, 1) cdf. We can estimate Ө consistently by 
NLLS. But that is not optimal, because heteroskedasticity is present: 
the variance of a binary variable depends on its expectation. A version 
of FGLS might be used, but instead we consider ZES-rule (or ML) 
estimation, which is operational because the form of the conditional 
pmf of the binary variable y has been automatically specified. 
That conditional pmf of y|x is 

[F(x’®)P[] — F(x'9]?, 
whose logarithm is 

L = y log F(x'0) + (1 — у) 108[1 — F(x'0)]. 


The scores (derivatives of L with respect to the parameters) are 


s; = 1.190; = (Pfa - [(1 — у)/(1 — PM; 
= уху — Е)ДЕ(1 — F)] 
= zu, 


say, where 


g-fx|F1ü-F] u=y-F, = Кх), F= F(x'®), 


with f(-) denoting the N(0, 1) pdf. 

The general rule that score variables have expectation zero is easily 
verified here: because u = y — E(y|x) and z is a function of x (not of 
y) it follows that E(s) = E(u) = 0(j = 1,..., k). The ZES-rule 
estimators are the values that make the sample counterparts of the 
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expected scores equate to zero. For a sample of n observations (y;, x!) 
(2 = 1,...,m),let 


Sq = Ziu; 
where 

zy = хь ЦЕХ — Е;)], u; = у; — F; 
with 

= fic), Е, = F(xie). 
Choose c to make 2s; = 0, that is, to make 


У qu = 0 (j^ 1,..., А. 


In matrix form this is 
Z'u = 0, 


where Z = {z,} is n X А. This is a system of k nonlinear equations in с, 

. . , € the сѕ enter both the w’s and the 25 nonlinearly. Various 
computer programs are available for numerical solution: see Judge et 
al. (1988, pp. 786—795), and also see Exercise 29.3. Provided that a 
global maximum of the log-likelihood is located, the ZES-rule estimator, 
say t, is the ML estimator of 0. 

For statistical inference, we may rely on the general theory for multi- 
parameter maximum likelihood estimation, namely 


Уе — 9) > (0, Ф). 


Here Ф = [V()] ! = [E(W)] ', with s = (s) = (8L/00) being the 
population score vector, and W = {w} = —(05/90,) = —(2^L/(90,90,)) 
being the population second derivative, or Hessian, matrix. The general 
rule that ML estimators are BAN can be verified here by showing that 
the ® for ML is less than that for NLLS. The required calculation 
parallels that which shows that GLS is preferable to LS in a linear GCR 
model. 

In practice, the ® of ML may be estimated as the sample second 
moment matrix of the estimated score vectors, £,$,$/5, where the hats 
denote evaluation at c = t. Some computer programs (including that in 
Exercise 29.3) will instead evaluate the sample mean second derivatives 
at t. For discussion, see Greene (1990, pp. 677—678). 
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Maximum likelihood estimation is also straightforward for the logistic 
and Tobit models introduced in Section 29.1: see Maddala (1983, pp. 
' 99—97, 151—158). 


Exercises 


29.1 Suppose that the logistic, rather than probit, model applies. So 
E(y|x) = G(x'0), with G(a) = exp(a)/[1 + exp(a)]. Show that the ZES- 
rule estimator c satisfies X'u = 0, where u = {u,} with и, = y, — G(xic). 


29.2 The Tobit model specifies that y* ~ N(x'0, o°), while the probit 
model specifies that y* ~ N(x’@, 1). Why is the variance set at 1 in the 
probit model? 


29.3 For the SCF data set, let 


1 if earnings < 9 (thousand dollars), 
0 otherwise, 


and let the x variables be defined as in Exercise 17.4. Model A for this 
quite artificial binary response example is 


E(y|x) = F(x'0), 


where x consists of х], x», xs, xg, X7, xg (that is, the constant, education, 
experience, and the three regional dummies), and F(.) is the standard 
normal cdf. Model B is a shortened version, with only ху, xg, xs included 
as explanatory variables. Estimate = models by maximum likelihood 


GAUSS Hint ¥ » Bode 
ASG2903 is a simple ciis: doing X ncs likehhood 

estimation of the probit model The Model specific S Section ofthe 

program is specialized to deal: hte present. probit Model A 
hen you have ad ifferent m del (probit Mod 

ihe future) that section has to be ch 

D ASG2908 */ 


эдем output file. - 
байр datestr tinest 
" ASG2908 EXERCISE:293 2 
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FMAXM Mr ANA ESTIMATION' 
тоб Моде Ez) = F(x с) F() = NO ) сағ. 
ey at Earnmgi = $ 9000 x = 1 ed exp 3 regiónál dummies ; ? 
£00 load D{n 12] = scf 
s= D[ 9Éy = ys > 9 & 
xl ones(nl) x? = D[ 3] x3 = D[ 5] 
Jers = = D[8] ==2*x1 x7 = D[ 8] == 3*x1 х 
E к= xL х? x3 хб x7 x8 P dex Т 
80801 -01 -06 -0ї, e ‚* 
ж. E END с OF MODEL SPECIF IC SECTION. + 
= UE = xow df-n-k de 1 tol = lex, ater= | 
Ri Oglip y) = su&ié(y *In(p) + (1 — y) * nC LB». EH 
Wet < f — ITERATION LOOP S AA 
gone * Computing ML Iterations Criterion 1s —2 Loglikelihood : +e 
аё йш abs(dc}, tol dis ЭЙ 4. * a 
w-X*c р = cdfn(w) f =pdfn(w) Ik = logl(p у) 
xu = Х((Є 7(р Ж(1- p *(y-p) v= ft w*p 
r= sqrt(f *(y * (у /(p *p) + (1-у) * (v—wy 7p) * (1—р))))) 
" Z=X *r Q= ZZ QI = mvpd(Q) dc = QI*xü*cn = с + ас 
„Хозир PRIT с =сп iter = iter + 1 endo 


E 


"озь FINAL "ed bd | 
NET |* — — SUBROUTINES ——*/ 

"PRIT format 10 {teranon # iter " 

d format 10 4 Criterion = —2*lk "cz "cn return 


FINAL Vc = QI se = sqrt(diag(Vc)) cls format 5 0 

"Samplesue n “ Degrees of freedom;" df 

” # of iterations iter format 14 4" Tolerance tol? 
У € — -2Loghkéhhood —92*lk ? 

pd $^ абое" Estimate Std Error 


E 


a UA a format 10 с”) i 


te 


x 


"4 


у se[j1] j 5) + 1 endo, return 
rw 


29.4 For the setup of Exercise 29.3, estimate Models A and B | 
»nlinear least squares. 


g аа Cy 


7$ & E - XX а 
904 аар program for domg nonlinear, least squares 
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та gm m 


different model (probit Model B! heres say or in Dd future) p 
section has to be changed Be ar 
/* ASG2904 */ T 


/* - MODEL SPECIFIC SECTION */ 
new output file — asg2904 out reset 
loadp datestr tmestr datestr timestr ? 
ASG2904 EXERCISE 29 4 Name * 
SCF dataset 100 family heads from 1963 Survey of Consumer Finances ? 
"NONLINEAR LEAST SQUARES ESTIMATION .. 2. ў 
Probit Model. E(ylx) = F(x с) Fi) = № 1) саг 5 7 
y = 11f Earnings > $ 9000 х = 1, ed ехр, 3 воа dámmnes 7» tis 
n = 100 load D[n 12] = scf 
ys=D[ 9] y=ys >9 
х1 = ones(n 1) x2 = D[ 3] узе DES Я к 
хб = D[ 8] == 2*x] x7 = D[ 8] == 3*x1 x8 = =Df,8) = = 4*x] 
fn h(c) = cdfn( q1 ] + < ] *x2 + d3 ] * х3 +04 ] *x6 
*d5]*x7-4d6]*x8) «> 
let c0 = -ə 03 01 -01 -06 —01 
/* END OF MODEL-SPECIFIC SECTION -- */ 
с= с0 К = rows(c) df =n – К ас = 1 tol = le—-8s p = le-6 E = eye(k) 
mxsrch = 10 fn delc(c ay) = c + a*E[ j] * p 
* ITERATION LOOP —— */ 
** Computing NLLS Iterations Criterion 15 Sum of Squared Residuals ** 
iter = 0 u = y ~ h(c) ѕѕе = uu сп = с gosub PRIT iter = 1 
do unt abs(dc) < tol 
Z = zero (n k) gosub GRAD Q=ZZ u- y ~ h(c) w=Zu 
QI = invpd(Q) dc = QI*zu sc — c sdc = dc gosub SEARCH 
gosub PRIT dc = c — cn c = cn iter = iter + 1 endo, 
gosub FINAL end 
f* — SUBROUTINES = A 
GRAD /* Columns of partials of h(c).with respect ti ey ха, 
J=  dountil j >k Z[ j] = (h(delc(c 13) — n(delecE-1,))(2*p) 
j=jt 1 endo return 
SEARCH /* Compares sse at c + s*dc and. atc + (/2)*de, where 
dc = proposed change in c .Initzally s 1s set af then 
s 15 halved until —sse stops dechmng */ c5 WO x 
scl = sc + sdc sm = 1/2 srch = 1 сс = scl gosub-GOM el = sse 
do unt I srch > mxsrch AS 
x SC2 = sc + sm*sdc 
Wael >= ед scl=sc2 sm=sm/2 
EX else src = srch — 1, cn = scl retur; 


з 
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He. 


29.5 The standard errors computed in the NLLS program ASG2904 
rely on an assumption of homoskedasticity. In binary response models, 
that assumption is automatically violated because 


У(у|х) = Е(у|х)[1 — Е(у|х)]. 


Modify the program to produce correct standard errors. 


30 Regression Systems 


30.1. Introduction 


Suppose that we have a two-equation linear regression model. The data 
consist of 


Y= (yi, Yo); Z= (xi, Xo,..., X4), 


where each of the vectors is п X 1. The matrix Z is nonstochastic with 
rank(Z) = &. The columns of Y are random vectors with 


E(y) = Х,В,, E(ys) = Х.В, 
V(y) = с, V(Yo) = ass, C(y», Yo) = Tiol. 


Here X,(n X kj) and X,(n X ky) are submatrices (possibly overlapping) 
of Z, and the 2 X 2 matrix 


See И id 
O21 O22 
is positive definite. 

А CR model applies to (y;, X;), and a CR model applies to (у, X3). 
The new feature is the covariance, буз = сә, between corresponding 
elements of y, and ys. 

This is the two-equation case of the regression-system, or set-of-regres- 
sions, or multivariate regression, or SUR ("seemingly unrelated regres- 
sions") model that plays a central role in contemporary econometrics. 
For economic settings, consider an input demand system where y; and 
Je are the cost shares for labor and capital, while the x's include output 
and input prices. Or consider the reduced form of a simultaneous 
supply and demand system where y, = quantity, у = price, and the x’s 
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include income, input prices, and prices of substitutes. Or consider 
investment demand у, апа ys by two firms in an industry, where the 
observations run over time, while X, includes variables for the first firm, 
and X, includes variables for the second firm. In all these cases one 
might expect correlation between the two depezdent variables at each 
observation. In some cases. X, and X, will coiz.ze (each containing all 
k of the x's, so X, = X, = Z); in other cases they will differ. 

For an underlying framework. we suppose 1+2: there is a multivariate 
population with joint pdf for the random vecors y = (у, ye)’ and z = 
(xis -> - x). With x, (A; X 1) and x. (Ay x 1) being ‘possibly overlapping) 
subvectors of z, suppose further that 


E(y|z) = [5 ; Му = En 22) i 
| Fis! 


So both CEF's are linear and the conditional variances and covariance 
are constant. If we use the stratiticd sampling scheme, choosing a set 
of vectors 2; (i = 1,.... n). and then drawing independently (over і) 
from the bivariate conditional distributions of y z’. then the SUR model 
will result. There is also a neociassical varian: of the model, in which 
sampling is random from the joint distributior. of ty’, 2’). 

While our discussion will be conned to the two-equation case. the 
model and analysis generalize d:zzczw to the сазе where there аге more 
than two equations. 


30.2. Stacking 


The CR specification applies to 2:-- vegressior. separately. So equation- 
by-equation LS estimation will Бе unbiased, anc the associated variance 
estimates will be unbiased as we Those separz:e LS coefficient vectors 
are 


b, = AY bə = A.v.. 


where 
A, = (О) !Xi. QO. =х Хх, 
А» = (9,5) Хз. Оз. = X.X, 
Clearly, 
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E(b,) = Bi. Е(Ь,) = Bo, 
V(b,) = TQ) 5, Уф») = T22(Qə2) `. 


The fact that C(y,, уз) is nonzero suggests that better estimates тау 
be obtained by estimating the two regressions jointly. To do so, first 
stack the two regressions into one. Let 


= У! х- (5 x] 
d m о X,/’ 


[uI ої _ (Bi 
* bu un 3 B (8) 
Then Е(у) = ХВ, V(y) = X positive definite, X nonstochastic, and X 
has full column rank, so a GCR model applies. The SUR model is just 
a special case of the GCR model, special in the structure of its X matrix, 
and, at this stage, in the pattern of its X matrix. 

For some purposes, a more compact display of X is convenient. If D 
is m X m and E is n X n, then the Kronecker product of D and E, denoted 
D @ E, is the mn X mn matrix obtained in blocks by multiplying each 
element of D into the matrix E. On that definition, we can write the 
SUR model variance matrix X as X* Q I, where the I is n X n. 

In our two-equation case, there аге 2n observations, k, + А, regression 
coefficients, and 3 distinct elements in the variance matrix X. Consider 
LS regression of the 2n X 1 vector y on the 2n X (А + kj) matrix X. 
The estimated coefficient vector is ё = Ay, where A = Q™'X’. Now 


(х, о ‚_{х © 
n E x]. ^ [o x. 


9 -xx- (Qu E e^ = (Q07 о ) 


о Q22 о (0,5) 
о (QUUX о ү ГА о 
Б =( o & x 


So 


^ A, O y А,у b 
«(t )(@-@-@, 

B d (о А, Yo Азу» bə 
We conclude that LS estimation of the stacked regression reduces to 
equation-by-equation LS. 
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30.3. Generalized Least Squares 


Because a GCR model applies, GLS estimation of the stacked regression 
should be preferable to LS. Taking X* and hence X to be known, the 
normal equations for GLS estimation are 


(X'X'X)*-X'X y, 


b* 
ж = 1 
" t 


For GCR models, a data-transformation device is ordinarily used to 
convert a GLS problem into an LS problem (see Sections 27.3 and 27.4). 
But that device is not particularly convenient in the SUR special case, 
so we continue to work directly with the GLS normal equations. 

It is easy to verify that 


1 12 uo 12 
2 сї oJ " С с m 
У! = M M) , with (7. а) = yeh 


illustrating а general rule for Kronecker products: if D and E аге 
nonsingular, then (DG E) ‘= D'' Q E. 


Continuing, we have 
xx - (528 e x Ey = (9? Ху: + оху 
oU. о" Y 7 (охуу, + o? Xy, 


li«xz: 12. 
E XiX, co"X'X 
O = (7 ^^: 142 
> [o с" 


with 


The normal equations may now be solved for b*. 
We do not give the explicit solution here. Instead, we observe that 
the following expressions satisfy the GLS normal equations: 


(30.1) bf = b, — o, A,(ys — Xeb3), 
(302) bš = b, — «ьА„(у, — Xibf), 
where 


(30.3) о = 0,10, Qs = 919/9}. 
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These expressions show that bf depends оп, у, as well as on уу, and 
similarly for bf. They also show how knowledge of X* up to propor- 
tionality suffices to compute the GLS estimator. 

Three special cases are instructive. 

Orthogonal Explanatory Variables. If ХХ, = О, then A,X, = О, so 


As(y; — Xjbf) = Ау, = Ax(Xib, + e) = А,е;, 


and similarly Ауу» = Ае», where e, and e; аге the LS residual vectors. 
So the system (30.1)-(30.2) reduces to 


bf = b; — a, A;es, bf = b, — asAse;, 
which we may write as 

bf = Ауу, Б = Asy? 
where 

yi = Yi — eue, уз = Yo — азе. 
This provides a simple algorithm for GLS in the present case: construct 
у: and regress it on X, to get bf; construct yg and regress it on X, to 
get b£. 

Identical Explanatory Variables. If X, = Xo, then Ag = Aj, so Азу, = 

Алу, = b, and А.Х, = A,X, = I, 50 А (у, == Xjbf) = Ь, TT bf. Similarly, 
А (у — Xgb¥) = b, — bž. So the system (30.1)-(30.2) reduces to 


bf — b, = -a(b — b£), bf — b, = —as(b, — bř). 


Here the solution is obvious: bf = b,, b¥ = by. This is a striking result: 
if the explanatory-variable matrices of the two regressions are identical, 
then b* — b. That is, the GLS and LS estimators will coincide in every 
sample. 

Uncorrelated Disturbances. If 0,5 = 0, then a, = а, = 0, and the system 
(30.1)-(80.2) reduces to bf = b,, bš = bg. If the population covariance 
between the two dependent variables is zero, then b* = b. That is, the 
GLS and LS estimators will coincide in every sample. 


30.4. Comparison of GLS and LS Estimators 


The variance matrix of b* is given by V(b*) = (X'X ! X) '. Upon inver- 
sion, we find that the variance matrix of the GLS estimator of Ba, which 
lies in the southeast block of V(b*), can be written as 
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(30.4)  V(bi) = ass (X;X; + $XPXE) , 
where 
$?-p'—p) р = о/о),  Xf-MX. 


Here р (which must lie between – 1 and 1) is the population correlation 
coefficient of y; and yə. The variance matrix of bf contrasts with the 
variance matrix of by, the LS estimator of f», which is 


(305) Уф») = es (Х.Х). 
Because ф? = 0 and the matrix X¥'X# is nonnegative definite, we have 
XX, + h?X¥'X¥ = XIX. 


Inversion reverses the inequality, so V(b.) = V(b£). The difference will 
be large—that is, GLS will be much more precise than LS—when the 
matrix 6?X3'X# is large relative to the matrix ХХ. This occurs when 
р? is large and/or X, is poorly fitted by LS linear regression on Х|. 

Extreme cases are again instructive. Suppose that X, and X, are 
orthogonal, that is, X; X, = О. Then Х = Х,, and 


XjX, + ф°Х}'Х# = (1 + ф?)Х:Х, = XXV — p°), 
so | 
V(bf) = (1 — р?)уф,»), 


and a sharp comparison is seen. For example, if р = 0.8, then 1 — р? = 
0.36: the GLS coefficient estimates will have variances that are about 
one-third as large as those of the LS coefficient estimates. Evidently, 
GLS is particularly advantageous when X, and X; are orthogonal and 
р? is large. 

At the other extreme, if X, and X, are identical, then X¥ = О and 
V(b3) = Ү(Ь,), as it must because Ь$ = b, in every sample. Or if т,» = 
0, then ф? = 0 and again V(bf) = V(b,), as it must because Ь = b, in 
every sample. 

In general, the explanatory variables are not identical and the covar- 
iance is nonzero, so GLS will be different from, and hence preferable 
to, LS in the SUR model. 
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30.5. Feasible Generalized Least Squares 


We proceed to the practical situation, where %*, and hence X, is 
unknown. The natural estimators of the elements of 2* come from the 
residual vectors of equation-by-equation LS regression, e; and es. Let 


б; = е;е/п (7 = 1, 2), 
and 


È = p 25) EDI eu ба) | 


Go, O22 


All A19 all al2 
a ó I 6'I : ó ё A i 
x LS ( ) vim ( 21 е.) = >* А 


6] 6] 


The FGLS estimator 


A br 
ж = |. 
Р (ы) 


is the solution to the normal equations x'$-x)b*- X'X-ly, that is, 

bt = (X'E-CX)UX' Oy 
Comparison with Eqs. (30.1)-(30.2) shows that the FGLS normal 

equations are solved by: 
bt = b, — аА (у — Xbi), 
bš = b, — à,As(y, — Xjbt), 

where 

à, = 6/6, б = 6/6. 


So the algebraic analysis of Section 30.3 carries over: FGLS and LS 
coincide if X, = X, or if бү» = 0. Of course, the latter condition will 
occur only by coincidence even if тү» = 0. 

This FGLS estimator, introduced by Zellner (1962), is sometimes 
known as ZEF (Zellner efficient). Under general conditions, it has the 
same asymptotic distribution as the GLS estimator. That conclusion 
stems from the quality of the estimator of €*, which we now explore. 
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We know from CR theory that е, = Myy,, with E(e,) = 0 and V(e,) = 
91,M,. So E(eje;) = (n — &y)o,,, whence E(ó,,) = o,,(1 — Ап). The bias 
goes to zero as n increases. Equally relevant is the fact that the variance 
goes to zero. This convergence is especially easy to see under normality. 
For then ш, = eje,/o,, ~ X (n — kj): see D6 in Section 21.1. Now, à; = 
(9, ,/n)w,. So 


У(б\,) = (с/т)? (ш) = (оу ,/n)?2(n -h)- (201/n)(1 — kn), 


which goes to zero with n. Thus бу, converges in mean square to Oj), 
so it is a consistent estimator of o,,. Similarly for фу». 

Proceeding to the covariance, we have e, = M,y,, and e, = Moyo, so 
by R6 (Section 15.1), 


C(ei, e) = М,С(у,, Y2)M2 = о»М,М», 


whence 

E(eies) = tr(C(e;, е„)] = oi;tr(M;M;). 
So 

Е(б 5) = Gyotr(M,Mo)/n. 
Now 


tr(M M5) = tr(I = N, A Ns t М.М») = п Эш ky v ks + tr(N,N,). 


If XiX, = О, then N,N, = О and tr(M,M3) = n — k; — ko. At the other 
extreme, if X, = X5, then N, = № and tr(M,M.) = n — А = n — А. 
Indeed, it can be shown that n — k; — ko = tr(M;M3) = n — min(A;, kọ): 
see Theil (1971, рр. 317-322). So E(6,5) goes to сз as п goes to infinity. 
Similarly it can be shown that V(6,5) goes to zero, so бф» is consistent 
for Ojo. a 

The result is that £* is consistent for X*. This quality of the variance- 
matrix estimator suffices to make the asymptotic distribution of FGLS 
the same as that of GLS. One might divide by the appropriate scalars 
rather than by n to get an unbiased estimator of X*, but there is no 
advantage in doing so because asymptotic properties are sought. 

Having obtained the FGLS estimate b*, one proceeds to estimate its 
variance matrix as 


fb» = x'X-x, 
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and to use this just as one would use Ў(Ь) = 6°Q™’ in the CR (or CNR) 
model. This practice has an asymptotic justification. 

Despite the fact that it is a nonlinear function of y, the ZEF estimator 
is unbiased under quite general conditions, as shown in an elegant 
argument by Kakwani (1967). 


30.6. Restrictions 


In empirical applications of the SUR model, cross-equation restrictions 
are often imposed. For concreteness, suppose that we start with 


E(y)) = xiBii + х6, 


E(y2) = xiBio + XsBss. 


Upon stacking, this becomes 


Bi 
x, x, 0 2 Boi IS «A 
E = = =X , 
y li 0 x, Xs Bis о X, AB B 
Взә 
say. Consider the cross-equation restriction Bs; = fs, (= Bo, say). It 
implies that 
Bi 
_ [ху X, 0 _ 
E(y) EE (3 хз | Во 3 XoDo. 
Bis 


say. 

Although this X, matrix does not have the characteristic block- 
diagonal SUR pattern, GLS or FGLS estimation can proceed. Observe 
that X, = XT, where 


100 
[0 1 0 
Tlo 94 

0 1 0 


So 
X) Ux, = TRS KT. OXi$ y= TXS y 
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and the restricted GLS estimator of В, is (T'X'E !XT) 'T'X'X |y. 
When such restrictions are imposed, GLS and LS can differ even 
though the original explanatory variables are identical and the distur- 
bances are uncorrelated. The previous conclusion that GLS and LS 
coincide relied in part on the block-diagonal structure of the X matrix. 
For a sharp example, suppose that X, = X, (= X^, say), that gj, = 0, 
and that the restriction is В, = Bə (=, say). Set up the restricted model 


as 
: I О 
Еу) = || Bo, у= [a En | 


The LS estimator of Bo is 
b, = QX"X^) (Xy, + X"yy) = (1/2), + bə), 


while the GLS estimator is 


by = (о X"X* + age X"X*) (от X"'y, + оз: Xy) 
(ci! + суа) (ХХ) от: X" yi + 052 X"'y,) 
Ob, (1 — Ө)Ь,, 


where 
Ө = С + 053) = gss/(01, + ©»). 


We see that b, and bj are distinct weighted averages of b, and by. 

The rule that identical explanatory variables make GLS coincide with 
LS is still valid on the understanding that once cross-equation restric- 
tions are imposed, the explanatory variables in the two equations should 
no longer be considered identical. After all, linear restrictions on coef- 
ficients are like zero-null-subvector restrictions, as we saw in the single- 
equation context of Section 22.2. 


30.7. Alternative Estimators 


There is a useful way to look at the algebra of the SUR model that may 
clarify the relations among the various estimators. Let 


u-v;—-Xc, (j-1,2) u = (uj, us)’, U = (иу, uy). 
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Then the GLS procedure can be restated as: choose с = (ci, c5)' to 
minimize the criterion 


фе) = ш 'u 


iiu с! oI п, 
i: "2 о? oy ГА 


11 1 
outo =) 


2 2 
с lu, +o 20, 


= (uj, uj) ( 

= gllutu, + o?uju, + c" uiu, + c?'utu, 

= tr(E* ^ 'U'U). 
Here U'U is the 2 х 2 matrix of sums of squares and cross-products of 
deviations from the regressions. By the same token, FGLS chooses c to 
minimize the criterion tr(X* 'U'U), while LS chooses c to minimize the 
criterion tr(U'U) = u'u. 


Now suppose that we have the normal version of the SUR model. 
That is, y ~ Хи, X), where 


velo. eic See eu) 

with p; = X, (j = 1, 2). The pdf of the 2n X 1 random vector y is 
fly) = (т) "|| exp(—w/2), 

with 
ш = (у – ш) (у – р) -7uX'u-u(X*'U'U). 


Further, because X = X* Q L,, we have |2| = |E*|". So the sample 
log-likelihood function is, apart from an irrelevant constant, 


log £ = (-1/2) [n log| *| + t(X* 'U'U)]. 


If X* is known, then this log-likelihood is evidently maximized by 
taking, as the estimator of B, the value of c that minimizes tr(E* !U'U). 
But this is exactly the GLS criterion, so in the SUR model with X* 
known, the normal-ML and GLS estimates of В are identical. If 2* is 
unknown, normal-ML will choose estimates of the o's along with esti- 
mates of the fj's. It is straightforward to show that for any choice of c, 
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the sample log-likelihood function above is maximized with respect to 
the o’s by taking 2* = U'U/n: compare the single-equation case of 
Section 19.2. Inserting that (conditional) solution into the log-likelihood, 
we have the "concentrated log-likelihood function" 


log £* = (—n/2)(log|U'U/n| + 2), 


which remains to be maximized with respect to c. So in the SUR model 
with $ unknown, the normal-ML procedure reduces to minimizing 
log |U'U/n| or, for that matter, minimizing |U'U|. Along with the LS, 
GLS, and FGLS criteria above, the normal-ML criterion is just a scalar 
measure of the matrix U'U. 

Because the ML and FGLS criteria are different, we should anticipate 
that the resulting estimates will be different in general. Nevertheless, 
the estimators have the same asymptotic distribution. If the explanatory 
variables are identical in the two equations, then normal-ML and FGLS 
(and LS) estimates do coincide in every sample. 

Having obtained the FGLS estimator b*, we might calculate fresh 
residuals e* = y — xb*, and use those residuals to re-estimate X*. Using 
that new estimate of 2* in place of the original one will generate а new 
FGLS estimate, say b**. If this process is continued until convergence, 
that is, until the successive estimates of B stabilize, then the result is 
called the iterative FGLS estimator, sometimes known as IZEF (iterative 
Zellner efficient). All the successive estimators, including the terminal 
one, will share the asymptotic distribution of the GLS estimator b*. If 
carried through to convergence, IZEF will solve the FOC's for normal- 
ML estimation: the iteration procedure turns out to be an algorithm 
for solving the ЕОС” for minimization of |U'U|. 


Exercises 


30.1 True or false? In the SUR model, if the explanatory variables in 
the two equations are identical, then the LS residuals from the two 
equations are uncorrelated with each other. 


30.2 True or false? In the SUR model, if the explanatory variables in 
the two equations are orthogonal to each other, then the LS coefficient 
estimates for the two equations are uncorrelated with each other. 


Exercises 335 


30.3 Suppose that 


E(y)) = xiBi, E(Y2) = x$Bs, 
V(yi) = 41, V(y2) = 5I, С(у Y2) = 21. 


` Here y, yo, Xy, and x; are n X 1, with xix, = 5, хох» = 6, хіх = 3. 
Let b, and 5$ denote the LS and GLS estimators of B. Calculate V(b) 
and V(b$). 


30.4 Suppose that y, and yə are bivariate-normally distributed with 
unknown expectations p, and ps, and known variances and covariance. 
Consider random sampling, sample size 100, from that population. 


(a) Can we improve on the sample means y, and jy; as estimators of 
ш. and ps? If so, how? If not, why not? 

(b) Now suppose that it is known that po = 2p,. How would your 
answer in (a) change? 


30.5 For the two-equation SUR model, suppose that с; = 0. Show 
that LS is preferable to FGLS, at least for samples of modest size. 


30.6 For the two-equation SUR model, derive Eq. (30.4): 
V(B3) = ex (XX, + $^ XPX$) ". 
Hint: Adapt the Submatrix of Inverse Theorem in Section 17.5. 


30.7 Table A.6 contains annual data on two firms, General Electric 
(GE) and Westinghouse (WE), for 1935—1954, taken from Theil (1971, 
р. 296). The variables are: V1 = Year number (1, ..., 20), V2 = GE 
investment, V3 = GE market value, V4 = GE lagged capital stock, V5 = 
WE investment, V6 — WE market value, V7 — WE lagged capital stock. 
Variables V2—V7 are measured in millions of 1947 dollars. This data 
set is presumed to be available as an ASCII file labeled GEWE. 

Suppose that the SUR model applies to 


E(yi) = ZB, + 285 + 2383, 


E(ys) = 1.8, + 2.85 + 2565, 


where y, = V2, z, = 1, z2 = V3, zs = V4, yo = V5, z4 = V6, 25 = VT. 
Using this data set, write and run a program to 


336 30 Regression Systems 


(a) Calculate the LS estimates of B, and В», along with their standard 
errors. 

(b) Using the residuals from those LS regressions, estimate X*. 

(c) Calculate the FGLS estimates of B; and ə, along with their 
standard errors. 


30.8 For the setup of Exercise 30.7, calculate the iterative FGLS esti- 
mates. That is, get the residuals from FGLS, use them to re-estimate 
€*, and thus X and В. Continue this process until convergence, say 
until the successive estimates of B coincide to three decimal places. 


31 Structural Equation Models 


31.1. Introduction 


Economists find it natural to model economic phenomena as a set of 
simultaneous equations in which several dependent variables are jointly 
determined. It may appear that such models are not only natural, but 
indeed essential: simultaneity, reciprocal causation, and feedback are 
ubiquitous in the real world. Suppose that the validity of LS estimation 
rested on unilateral causation running from the right-hand-side deter- 
mining variables to a left-hand-side dependent variable in a regression 
equation. Then LS estimation would be inappropriate for models of 
joint determination. (For an argument along these lines, see Judge et 
al. 1988, pp. 599—601.) 

But this rationale for special econometric treatment of simultaneous- 
equation models may be questioned on several counts. 

The causal requirement that in regression the x’s have to be the 
variables that actually determine y does not appear in the specification 
of the CR model: nothing in the CR model requires that the x's cause 
y. Indeed it is not obvious why the validity of a conditional expectation 
function (and its estimability by least squares) should depend on the 
assumption that x causes y. For example, suppose that x — father's height 
and y — daughter's height are bivariate-normally distributed. Then 
E(y|x) = a + Bx, so in random sampling, the LS regression of y on x 
will unbiasedly estimate а and В. But also E(x|y) = y + бу, so in random 
sampling, the LS regression of x on y will unbiasedly estimate ү and 6. 
Neither of those regressions relies on an assumption of causal direction. 

The fact that a model contains several dependent variables whose 
values are determined jointly cannot be an adequate reason to abandon 
LS. The SUR (regression systems) model of Chapter 30 had several 
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dependent variables whose values might be viewed as jointly deter- 
mined, and yet its parameters were estimable by LS. 

It is sometimes said that the SUR model is not really simultaneous 
because one does not have to solve any equations to get the explicit 
equations for the y's. (From that perspective, the algebraic system y; + 
J2 = 3x, у — уз = x is simultaneous, while its solution set, у, = 2x, yy = 
x, is not simultaneous.) If so, the notion that a simultaneous-equation 
model is required to represent an economic system correctly is tenuous. 
The solution to a system of equations is, after all, logically equivalent to 
the system itself. So, when an economic system can be represented 
correctly by a simultaneous-equation model, it can also be represented 
correctly by the reduced form of that model. If a pair of supply and 
demand equations (simultaneous) correctly represents a market, then 
so does the corresponding pair of quantity and price equations (non- 
simultaneous). 

A sounder case for special treatment of simultaneous-equation models 
can be made by arguing that those models represent situations in which 
the parameters of interest are not the parameters of a CEF (or BLP) 
among observable variables. For such situations, it should be clear that 
LS, which is inherently designed to estimate CEF's (or BLP's), will not 
be an appropriate estimation procedure. 


31.2. Permanent Income Model 


An instructive nonsimultaneous example at this point is Milton Fried- 
man's permanent income model of consumption: 


у= а + Вх + о, =z +u, 


where y = consumption, x = income, z = permanent income, v = 
transitory consumption, u = transitory income. The observed variables 
are y and x, while the unobserved variables z, u, v are assumed to have 
expectations p, 0, 0, variances oĉ, c2, oĉ, and zero covariances. The 
parameters of interest are the slope [3 (which is called the “marginal 
propensity to consume out of permanent income") and the intercept а 
(which is relevant to Friedman’s hypothesis that the relation goes 
through the origin). 

For convenience, suppose that z, и, and v are trivariate-normally 
distributed, so that 
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0-6 ot) (= 


is bivariate-normally distributed. Then 
E(y|x) = a* + B*x, 
with 
B* = с/о, ож = р — B*p,. 
We calculate 
9, = C(z + u, + Bz + v) = Bor, o? = V(z + u) = о? + oê, 
p, =a + ВЕ(2) + E(v) = а + Bp, р. = E(z) + E(u) = p. 
Let 0 = 02/(02 + oĉ). Then | 


B*-0B, а= а + (1 — 6) Ви, 


50 


E(y|x) = [æ + (1 — 985] + (ӨВ)х. 


Clearly the parameters of interest, namely а and f, аге not the intercept 
and slope of this CEF for the observable variables y and x. 

If so, it is not surprising that the sample LS regression of y on x, 
namely f = a + bx, is inappropriate for estimation of a and f. If we 
randomly sample from the joint distribution of x and y, then a NeoCR 
model applies, whence E(a) = a* and E(b) = B* = 08: see Chapter 25. 
(The same conclusion follows under a classical, stratified-on-x, sampling 
scheme.) This result is often described by saying that LS gives biased 
estimators of the structural parameters а and В. But a fairer description is 
that LS gives unbiased estimators of the CEF parameters a* and B*, which 
happen to be different from «a and f. The latter description makes it 
clear that the issue is not one of estimators, but rather of parameters to 
be estimated. 

Normality is not crucial to this argument. If the underlying variables 
are not joint-normal, then E(y|x) may be nonlinear. But the best linear 
predictor E*(y|x) will still be a* + *x, with a* and В* as above. In 
random sampling, LS linear regression of y on x would consistently 
estimate a* and B*, and for that very reason be inconsistent for a 


and В. 
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31.3. Keynesian Model 


Next consider a simultaneous example from the same perspective. Take 
this stochastic version of the simplest Keynesian model: 


(31.1) у= а + Вх + о, 

(31.2) х= у +z, 

where у = consumption, х = income = output, х = investment, and и = 
"consumption shock." Equation (31.1) represents the demand for con- 
sumption, while (31.2) is the equilibrium condition, which says that 


output is equated to the sum of consumption demand and investment 
demand. Assume that 2 and и are random variables with 


E(z) = p, Vz) = o, Ew = 0, V(w-o?  C(zw-0. 


The zero-covariance assumption captures the idea that z is exogenous. 

The understanding is that for given values of the pair (х, и), the 
model determines the values of the endogenous variables x and y. (So, 
paradoxically, it is the simultaneous-equation model that explicitly incor- 
porates one-way causation.) The parameters of interest are a and В, 
and our concern is with whether those are estimable by sample LS 
regression of y on x. 

The solution for the endogenous variables is made explicit in the 
reduced form, which expresses each endogenous variable in terms of the 
exogenous variable and the shock: 


(31.3) у= (а + Bz + wy(1 — B), 
(81.5 x=(a+ z+ u)(1 — mg) 


For convenience suppose that z and и are bivariate normal. Then x and 
y are bivariate normal, so 


E(y|x) = a* + В*х, 
with 
Bt = оо, a* =p, – B*p,. 
From Eqs. (31.3)-(31.4) we calculate 
c, = (Boi + 02)/(1 – B), o = (0? + o8 (1 – By, 
B, = (а + By)/(1 – B), №, = (а + p)/(1 — В). 
Let 0 = o?/(o? + oĉ). Then 
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В+ = ӨВ + (1 – 0), а* = ва – (1 – O)p, 
so 
E(y|x) = [6a — (1 — Ө)ы] + [ӨВ + (1 — 9)]x. 


Clearly the parameters of interest, namely a and f, are not the intercept 
and slope of this CEF for the observable variables y and x. 

If so, it is not surprising that the sample LS regression of y on x, 
namely ў = a + bx, is inappropriate for estimation of a and f. If we 
randomly sample from the joint distribution of x and y, then a NeoCR 
model applies, whence E(a) = a* and E(b) = В* = ӨВ + (1 — 0). (The 
same conclusion follows if we adopt a classical, stratified-on-x, sampling 
scheme.) This result is traditionally described by saying that LS gives 
biased estimators of the structural parameters a and В. But a fairer 
description is that LS gives unbiased estimators of the CEF parameters 
a* and В*, which happen to be different from a and В. Again, the issue 
is not one of estimation methods, but rather of the parameters that are 
the targets of estimation. 

Normality is not crucial to this argument. If the underlying variables 
are not joint-normal, then E(y|x) may be nonlinear. But the best linear 
predictor E*(y|x) will still be a* + B*x, with a* and В* as above. In 
random sampling, LS regression of y on x will consistently estimate the 
parameters a* and B*, and for that very reason be inconsistent for а 
and В. 

Here is a direct way to view the situation without relying on normality. 
If we evaluate E(y|x) from Eq. (31.1), we get 


E(y|x) = a + Bx + Е(и|х). 
From Eq. (31.4), 

C(x, и) = Cla +z + и, ul — B) = o2/(1 — B). 
Because x and v are correlated, we see that E(u|x) # E(u) = 0. So 
E(y|x) Æ а + Вх. In the consumption demand equation у = a + Bx + 
u, the systematic part, namely a + Bx, is not the CEF of y conditional on 


x. In that sense, the structural equation (31.1) is not a regression equa- 
tion. 
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31.4. Estimation of the Keynesian Model 


To estimate the parameters of Eq. (31.1), we may look for a CEF among 
observable variables in which a and В appear uncontaminated by Ө. 
Our attention is directed to the reduced-form equation (31.4). We have 


E(x|z) = [a + х + E(ulz)V(1 — В). 


But E(u|z) = E(u) = 0 since z and u are independent under normality, 
sO 


E(x|z) = ту + тәх, 
with 
(31.5) m =al- В), то = (1 — B). 


So the systematic part, namely m, + maz, of the reduced-form income 
equation x = ту + пох + v, where v = u/(1 — f), is the СЕЕ of x 
conditional on z. 

Consequently, we should anticipate that LS regression of x on z will 
estimate the т”. If we randomly sample from the (bivariate-normal) 
joint distribution of x and z, then a NeoCR model will apply, so the 
sample LS regression of x on z, namely £ = р, + pəz, will unbiasedly 
estimate m, and т»; those estimates will be consistent as well. 

How can we convert these estimates of the reduced-form parameters 
into estimates of the structural parameters? Evidently, if we knew the 
reduced-form parameters т; and To, we could solve Eq. (31.5) to deduce 
the values of the structural parameters as 


а = TiTa, B = (ma — 1)/тә. 


So it is natural to convert the reduced-form estimates into structural- 
form estimates via 


^ 


& = рур, В = (p2 — gs. 


This illustrates the indirect least squares, or ILS, method: use LS to estimate 
the reduced-form parameters, and then convert into estimates of the 
structural-form parameters. The ILS estimates are consistent (via S2, 
Section 9.5), although not unbiased (because of the nonlinearity). 

An alternative approach to estimating а and В runs as follows. In Eq. 
(31.1), take expectations conditional on z: 
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E(y|z) = а + ВЕ(х|х), 


using E(u|z) = E(u) = 0. Let x* = E(x|z) = m, + тох. Because x* is a 
one-to-one function of z, we can write 


E(y|x*) = a + Вх*, 


which is a CEF in which е and f are the intercept and slope. If x* were 
observed in our sample, we could regress y on (1, x*) to get unbiased 
estimates of a and В. But x* is unobservable because m, and т» are 
unknown. Still, x* is estimable as £ = р, + poz, so the suggestion is to 
regress y on (1, £) to estimate а and B. This illustrates the two-stage least- 
squares, or 2SLS, method: the first stage uses LS to estimate the reduced 
form and obtain fitted values; the second stage uses LS on the structural 
equation after the fitted values replace the observed values on the right- 
hand side. We should anticipate that the 2SLS estimates are consistent, 
though not unbiased. 

The analogy principle also suggests a third approach. Consider again 
the consumption demand equation 


31.1) y at x t и. 


We know that in the population E(u) = 0 and C(z, и) = 0. That is, a 
and В are the values for c, and c; that make E(u) = 0 and E(zu) = 0, 
where now и = y — (cı + сәх). So let us choose as our estimates the 
values that make the analogous sample quantities zero. That is, take the 
values of c, and c, that make >ш, = 0 and %,z,u; = 0. This illustrates 
the instrumental variable, or 1V, method. We should anticipate that the IV 
estimates are consistent, though not unbiased. 

No method for obtaining unbiased estimates of the structural param- 
eters а and Q exists, because there is no СЕЕ among the observable 
variables х, y, and х that has a and В as its coefficients. 


31.5. Structure versus Regression 


We have just examined two situations where the parameters of interest 

are not the coefficients of a CEF among observable variables. Why then 

are the parameters interesting? Following Marschak (1953), we develop 

a rationale that takes prediction as the ultimate goal of the research. 
For the permanent income model, recall that 
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p* = ӨВ, a* = а + (1 — gp, 
where 
0 = 02/(02 + o?) 


Why is knowledge of œ* and 8* in the СЕЕ E(y|x) = a* + B*x not 
adequate? If our objective is to predict y given x, and there has been no 
change in the population, then knowledge of a* and B* would indeed 
suffice. But suppose that the population has changed, and that the new 
population is the same as the old one, except for a change in the variance 
of transitory income. That is, a, В, p, and o? are the same, but о? is 
different. Then 0, a*, and В* will all be different. Unless we have 
estimated the constituent parts of a* and B*—those parts being the 
structural parameters—we will be ignorant of the new CEF parameters. 
For the Keynesian model, recall that 


B* = 0B + (1 — 8), a* = да — (1 — 9p, 
where 
0 = o?/(c? + o?) 


Why is knowledge of a* and В* in the СЕЕ E(y|x) = a* + B*x not 
adequate? If our objective is to predict y given x, and there has been no 
change in the population, then knowledge of a* and В* would suffice. 
But suppose that the population has changed, and that the new popu- 
lation is the same as the first, except for a change in the variance of 
investment. That is, a, B, p, and с? are the same, but o? is different. 
Then Ө, a*, and В* will be different. Unless we have estimated the 
constituent parts of a* and B*—those parts being the structural param- 
eters—we will be ignorant of the new CEF parameters. 

To say that a set of parameters is structural is to claim that it is 
plausible that one of them will change while the rest of them remain 
invariant. It is a claim about the world rather than about the algebra or 
econometrics. 

From that perspective, a conditional expectation function may or may 
not be structural. The relevance of this remark is not confined to multi- 
equation models. 

Consider a perennial question: Which is the correct regression in a 
bivariate distribution? We have argued (Section 16.1) that both E(y|x) 
and E(x|y) may be legitimate targets of interest. But one target may be 
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more interesting than the other. Suppose, for example, that fi(x), the 
marginal pdf of x, will change with no change in go(y|x), the conditional 
pdf of y given x. Then E(y|x) will remain the same while Е (x|y) will not. 
If this is how the world works, then E(y|x) will be structural, while E(x|y) 
will not. For example, suppose that y — daughter's height, and x — 
father's height. Suppose that you are asked to, adapt results for the 
population at large to the subpopulation consisting of families in which 
the father has played professional basketball. Which CEF do you antic- 
ipate will be unchanged from the population to the subpopulation? 
Consider another question: What is the objection to running a short 
regression—won’t that estimate a СЕЕ in its own right? Suppose that 


E(y|x1, хә) = Bo + Bix, + Boxs. 


It is often said that it is wrong to omit x? and run the short regression 
of y on x, alone. But it is quite possible that 


E(y|xi) = Bá + Вїх, 


is also correct. It is certainly correct, with 8t = В, + ФВ», if the СЕЕ of 
хо ON x; in the population is linear, with slope à. Nevertheless, consid- 
erations of structure may dictate a preference for the long CEF. Suppose 
that the world changes because the joint pdf of x, and x changes, with 
no change in the conditional pdf of y given x, and xə. Then the condi- 
tional pdf of y given x, will change. In particular, a change in ф will, 
with В, and B, invariant, produce a change in Bf. Unless we have 
estimated the constituent parts of Bf—the structural parameters—we 
will be ignorant of the new СЕЕ for y given x. In a sense, the original 
short regression is not wrong; it is just inadequate. 

It should not be presumed that a long regression is inevitably more 
structural than a short regression. Suppose that a SUR model applies 
to 


(916) E(y,|z) = 2'B,,  E(s2 = 2 В, 
V(yilz) = ei, V(ys|z = юз, Су, Yo lz) = ens. 


If the distribution of the random vector y = (71, ys)’ conditional on z is 
bivariate normal, then 


(31.7) Е(у|2, у) = z'B$ + Oy, 


with 
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Ө = o15/0,, Bi = В, — op. 


If the conditional variance of y, given z changes with no change in the 
conditional expectations of y, and y, given z, then the long regression 
(Eq. 31.7) will change while the short regression in Eq. (31.6) remains 
invariant. 

To recapitulate: even if we are interested only in prediction (that is, 
in CEF's), if our interest includes predictions for populations other than 
the one from which our data came, we may well need to isolate structural 
parameters, those that may change individually. The argument may be 
reinforced if we recall an introductory microeconomics course, and ask 
why the determination of price and quantity is modeled in terms of 
separate demand and supply functions, rather than directly in terms of 
the exogenous variables (income, input prices, prices of substitutes). 
The answer is that we want to study what happens when the demand 
function shifts, while the supply function remains the same (or vice 
versa). 

Finally, we might concede some validity to the notion that causality 
is a requirement for regression models. To the extent that causal- 
ity is interpreted as "structural-ness," we may well agree that causality 
is needed to support interest in the parameters of a regression, while 
maintaining that it is not needed to support estimation of the parameters 
of a regression. 


Exercises 


31.1 Consider the permanent income model of Section 31.2. 


(a) Suppose it is known that consumption is proportional to perma- 
nent income, in the sense that a = 0. Propose a simple estimator 
of B that is consistent under random sampling. 

(b) Alternatively, suppose that we observe not only y and x, but alsz 
x’ = z + и’, where и’ has zero expectation and is uncorrelated 

with 2, и, and v. Propose a simple estimator of В that is consistent 


under random sampling. Hint: Find C(x’, x). 


31.2 In the Keynesian model of Section 31.3, show that the ILS, 2SLS, 
and IV estimates of В are identical. Hint: x = y + z at every observation. 
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31.3 Suppose that the endogenous variables у = quantity and p = 
price are jointly determined by this simultaneous-equation model: 
Demand. 4 = 30 — 2p + u, 
‘Supply. @=20+ pty, 


in which the disturbances u and v are independent normal variables 
with zero expectations and variances o? = 5, o? = 10. Consider random 
sampling from the joint probability distribution of quantity and price. 
Let Б denote the slope in the LS regression of quantity on price (with a 
constant term included). Calculate E(5). 


81.4 Consider the two-equation model: 
y = Ax + Uy, Jo = 05) + us, 
where х, u;, из are independent N(0, 1) variables.. 


(a) Calculate E(y2|x) and V(ys|x). 

(b) Calculate E(ys|x, yi). 

(c) In random sampling, will LS regression of yg on y, give an 
unbiased estimator of о»? Explain. 


31.5 Suppose that у, = x + иу, yo = 2y, + us, where х, и, and и» are 
trivariate normal with zero expectations, unit variances, and C(u}, из) = 
1/2, C(x, ш) = 0 = C(x, ио). Consider random sampling, sample size 50, 
from the joint distribution of x, y1, ys. 


(a) Let р be the slope in the LS regression of y, on x. Find Е(р). 
(b) Let b be the slope in the LS regression of y, on y,. Find E(b). 


81.6 Suppose that 
Yo = ay, + Qox + Uy, Yo = 03) + us, 
where х, иу, us have a trivariate normal distribution, with 
E(x) = E(u,) = E(ug) = 0, 
V(x) = 3, V(ui) = У(иә) = 2, 
C(x, му) = C(x, ug) = 0, C(u, иә) = 1, 


Qa, = 1, Qo = 2, аз = —3. 
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Consider random sampling, sample size 50, from the joint distribution 
of x, уу, y». Let b be the slope in the sample LS linear regression of y, 
on ys. Find E(b). 


31.7 The structural form of a model is 
yi = Ayo + Ax, + Uy, 
yo = 9371 + Qa4Xo + Q5Xs t uo, 


where ху, хә, хз are independent (0, 1) variables, u,, из are independent 
(0, 3) variables, and a, = —2, а, = 2, оз = 2, а, = 4, a, = 5. The 
exogenous variables (the x's) are independent of the structural shocks 
(the w’s). 


(a) Find the pair of reduced-form equations. Include expressions for 
the reduced-form disturbances in terms of the structural distur- 
bances. 

(b) Find the variances and covariances for the x's and the y's. Display 
your results as a matrix V(z), where 2 = (xi, Хо, xs, Уу, Yo)’ 

(с) Let E(yilyo, ху) = afy + adx,. Find the a*'s. 

(d) Discuss the qualitative relation between the a*’s and the a’s. 


32  Simultaneous-Equation Model 


32.1. A Supply-Demand Model 


In this chapter, we develop a specification that may be appropriate for 
linear simultaneous-equation models. But first, to fix ideas and intro- 
duce notation, we consider a two-equation system in which the endog- 
enous variables y, — quantity and y, — price are determined by the 
exogenous variables x, = income, x, = wage rate, and xs = interest rate, 
and the disturbances u, = demand shock, ug = supply shock. For 
convenience we suppress intercepts in both equations. The structural 
form of the model is 


(32.1) Demand. y, = о;у + osx, tau, 
(32.2) Supply. Jo = 037 + Qa4Xo + ахз + и», 


Taking the terms in y, and 5; over to the left-hand side and adopting а 
matrix representation we have 


1 c Qo 0 
(71, Y2) А e) = (Xi, Xo, x3) | 0 04 | + (uy, us), 
oa 0 «a 
or 
yV —- x'B * аи. 


In the structural-form coefficient matrices Г and B, the columns refer 
to equations, while the rows refer to variables. 

Solve for each endogenous variable in terms of exogenous variables 
and structural shocks to get the reduced form of the model: 
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(32.3) Quantity. y, = Tix; + ләүхә + тух + Vy, 
(32.4) Price. y» = TX; t TiooXo + Tis9Xs t Uo. 


In matrix form, we have 


T Tiz 
(ур 33) = (Ху, Xs, хз) | Tor Tee | + (71, Ve), 
Тз] iso 


or 

y 7 x'Il * у’. 
"To be more explicit: we solved by post-multiplying the structural form 
through by I^! to get 

y = x BI! + uT^"! = x'Il + v. 


Here П = ВГ! is the reduced-form coefficient matrix, and v' = u'T^! 
is the reduced-form disturbance vector. In П, the columns refer to 
equations; the rows refer to variables. 

For our supply-demand model we have 


DI"! = (А) E fa) , with A= 1 — оо». 
1 


So 
"T Tiz Qe AgAs 
П = | т, Ta | = (ША) | оо, о, , 
пз] 139 QU As 


у' = (vy, Vg) = (шу + aus, озш + ио)/А. 

To recapitulate: we began with a structural form that consisted of 
two linear equations relating the two endogenous variables, three exog- 
enous variables. and two structural disturbances. We derived the 
reduced form, which consists of two linear equations, each of which 
expresses one endogenous variable as a linear function of the exogenous 
variables and a reduced-form disturbance (which in turn is a linear 
function of the structural disturbances). All exogenous variables and 
structural disturbances appeared in each reduced-form equation, 
although they did not all appear in each structural-form equation. 
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32.2. Specification of the Simultaneous-Equation Model 


Our statistical specification for the linear simultaneous-equation model, or 
SEM, starts with a multivariate population. Suppose that the joint dis- 
' tribution of the т X 1 endogenous-variable vector y, the k X 1 exoge- 
nous-variable vector x, and the m X 1 structural-disturbance vector u, 
has these properties: 


(A1) yT 7 x'B tu, 

(A2) Г nonsingular, 

(АЗ) E(u|x) = 0, 

(A4) V(u|x) = X* positive definite. 


Here Г is m х m, B is k X m, E* is m X m. Assumption (А1) gives the 
system of m structural equations in m endogenous variables. Assumption 
(A2) says that the system is complete, in the sense that y is uniquely 
determined by x and u. Assumption (A3) says that x is exogenous, in 
the sense that the conditional expectation of the structural shock vector 
is the same for all values of x. Assumption (A4) is a homoskedasticity 
requirement; positive definiteness simply rules out situations where 
there is an exact linear dependency among the structural disturbances. 

In some variants of the SEM it is assumed that u|x is multinormal, 
in others that u and x are stochastically independent. For some pur- 
poses, a weaker exogeneity condition, namely C(x, u) — O, suffices. 

The specification in (A1)-(A4), when supplemented by a sampling 
scheme, will constitute our SEM. The following implications are imme- 
diate: 


(B1) у =х'П+у', 
with 


(B2) H 


Br’, v = uT. 


Here Eq. (B1) is the reduced form, in which each endogenous variable 
is expressed as a linear function of the exogenous-variable vector x and 
the reduced-form disturbance vector v, the latter being a linear function 
of u. Linear function rules applied to Eqs. (A3)-(A4) imply 
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(B3)  E(v|x) = 0, 

(B4) _ V(v|x) = Q*, 

with 

(Bb) | Q* = (T yX*I"! positive definite. 


So the reduced-form disturbance vector v is mean-independent of, and 
homoskedastic with respect to, the exogenous variable vector x. Then 
it follows from Eq. (B1) that 


(B6)  E(y'|x) = xT, 
(B7)  V(yl = Q*. 


Equation (B6) says that the systematic part of the reduced form consti- 
tutes a set of conditional expectation functions, and (B7) says that the 
conditional variance function is constant. If u|x is multinormal, then 
y|x will also be multinormal. 

Mean-independence implies uncorrelatedness, so 


(B8) C(x, u) = O, 

(B9) C(x, v) = O. 

In conjunction with Eqs. (B1), (B4), and (B5), these imply that 
(B10) Cly, v) = С(П'х + v, v) = V(v) = О*, 

(B11) Cy, u) = C(y Г'у) = C(y, VE = O*T = (T )yX*. 


In Eq. (B11) all elements of (Г !)'Z* may be nonzero, so each endog- 
enous variable may be correlated with every structural disturbance. And 
with C(y, u) # O, we have 


(B12) E(uly) = E(u) = 0. 


The contrast between Eq. (A3) and Eq. (B12) is critical: the structural 
disturbances are mean-independent of the exogenous variables, but not 
of the endogenous variables. (In weaker form, the contrast is between 
Eq. B8 and Eq. B11: the structural disturbances are uncorrelated with 
x, but not with y.) 

To illustrate, suppose that the SEM applies to our supply-demand 
model. The structural form is 
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(32.1) Demand. у, = ау + asxi taa 
(82.2) Supply. _ yg = озу, + Q4X2 + OX. + us. 
Taking expectations conditional on y; and x, in Eq. (32.1), we get 


Е(у\|у», ху) = Qyyo + суху + Е(и\|уе, xi), 


in which the last term is not equal to E(u;) — 0. If it were, then 
С(уг, ш) would be 0. But 


C(ys, Uy) = C(vs, ш) = (ША)С(озш + us, uj) = (Q301; + с. )/А 


is (coincidence apart) nonzero. While v, is uncorrelated with х], it is 
correlated with ys. Consequently the systematic part of the structural 
demand equation, namely 0,7. + ox, is not the СЕЕ (or BLP) for y 
given у and ху. If u|x and x were multinormal, then we would be 
assured that all CEF's were linear. In that case E(y;lys, ху) = ау + 
аўх\у, say, where af and až are deducible from the structural parameters 
and variances and covariances. That assurance is not available in gen- 
eral, but the negative conclusion remains: Е(у |ә, х1) © @1уә + ахі. 
Similarly іп the supply equation (32.2), we have 


E(yslyi. хо, Хз) = Озу + 04х + 053 + Е(из |71, хә, Хз) 
75 Qs, + Qaxa + 0х3. 


'This analysis may be summarized by saying that in the SEM, the system- 
atic parts of the structural equations are not regression functions. It 
follows that they ought not to be estimated by least squares. 

In contrast, consider the reduced form of our supply-demand model: 


(32.3) Quantity. y, = «x, + похо + пвх: + 0, 

(32.4) Price. J2 = тәх + Weoxg + 39х35 + Us. 

Because E(v,|x;, хо, хз) = E(vs|xi, xo, xs) = 0, we have 
E(yilxi, хә, хз) = түүх + Taxa + This, 
E(ya|xi, хә, хз) = тух + Weexe + TüssXs. 


The systematic part of each reduced-form equation is the CEF of a y 
given the x's. So the reduced-form equations are regression equations, 
and hence their parameters are presumably estimable by least squares. 
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32.3. Sampling 


Let us now turn from the population to the sample. Suppose that we 
obtain a sample of n observations from the multivariate distribution of 
x and y by stratified sampling: п values of x’, namely x; (= 1,..., 
n), are selected, forming the rows of the n X k observed matrix X, with 
rank(X) = k. For each observation, a random drawing is made from the 
relevant joint-conditional distribution gs(y|x), giving the y; (i = 1,..., 
п), which form the rows of the n X m observed matrix Y. Successive 
drawings are independent. This completes the specification of our SEM. 

For convenience we will confine our attention to the two-equation 
case, where 


Y = (у, уз), П = (пу, т), QO*= be "d | 


(9j Wee 
We have 
E(y) = Xm, E(yg) = Кт», 
Ү(уџ) = euL — V(y) = Weel, Су, уә) = w121. 


Except for notation, this is just the two-equation SUR (regression sys- 
tems) model of Chapter 30. The conclusion is that in the SEM, a SUR 
model applies to the reduced form. If the data were obtained by random 
sampling from the joint distribution of (x', у’), then we would have a 
neoclassical version of the SUR model. 


32.4. Remarks 


'The SEM specification turns out to be a roundabout, rather exotic, way 
of specifying a SUR model for the reduced form. We have already 
discussed methods for estimating the parameters of a SUR model, 
including LS, GLS, FGLS, and normal-ML. Why is a separate discussion 
needed for this special case of a SUR model? Indeed, why not just 
estimate the reduced form by LS? After all, the reduced form appears 
to have identical explanatory variables, a SUR situation in which LS 
coincides with GLS: see Section 30.3. 
What does justify special econometric consideration of the SEM? 
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In simultaneous-equation models, the targets of research are the 
structural parameters (the a's in our supply-demand example, the ele- 
ments of Г and B in the general case), rather than the reduced-form 
parameters (the тз, or П). So the SEM is a situation in which the 
parameters of interest are not those of the CEF's among observable 
variables. As a consequence, rules are needed for converting estimaies 
of the т?з into estimates of the a's, that is, for converting estimates of 
II into estimates of Г and B. 

Why not just get LS estimates of the тз and convert them into 
estimates of a's in the obvious way, as was done for the Keynesian model 
in Section 31.4? 

The answer here is two-fold. First, an SEM may imply restrictions on 
the т?з, in which case the best estimates of the тз are not obtained by 
equation-by-equation LS on the reduced form. Second, there may be 
по way to convert estimates of T's into estimates of a’s, because the 
parameters o. may not be uniquely deducible from the parameters 7. 
'These two items are interrelated: identification deals with the issue of 
whether II uniquely determines Г and B; restrictions deal with the issue 
of whether the prior knowledge of certain elements of Г and B implies 
restrictions on П. 


33 Identification and Restrictions 


33.1. Introduction 


We now investigate whether the structural parameters are uniquely 
determined by the reduced-form parameters. If we know the elements 
of the II matrix, can we uniquely deduce the values of the elements of 
Г and B? At first glance, this task seems hopeless, because there are 
only km elements in П, while there are m? elements іп Г and km elements 
in B. It appears that the number of unknowns, m(m + k), must exceed 
the number of equations, mk. But a structural model typically will 
include prior knowledge of certain elements of Г and B. Such knowl- 
edge reduces the number of unknowns, and hence opens up the pos- 
sibility of a unique solution for the remaining structural parameters. 
The prior knowledge may even be rich enough to constrain the values 
of the reduced-form coefficients. 

The key to our investigation is the matrix equation that relates the 
reduced-form coefficients to the structural-form coefficients, namely 
П = BI, which we may rewrite as 


ПГ = B. 


We will treat II as known along with particular elements of Г and B. 
The question will be whether we can solve ПГ = B uniquely for the 
remaining unknown elements of Г and B. When a structural parameter 
is uniquely determined in that manner, then we say that the parameter 
is identified in terms of П, or more simply, that it is identified. Although 
the focus is on identification, a few preliminary remarks about estima- 
tion are also included. 
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33.2. Supply-Demand Models 


We explore identification via three variants of a supply-demand system. 
Each variant is a two-equation model in which the endogenous variables 
у = quantity and y; = price are determined by the exogenous variables 
x; = income, x, = wage rate, and хз = interest rate, and the structural 
disturbances u, = demand shock, and и, = supply shock. 

Model A. First take the example of Section 32.1. The structural form 
is: 
(33.14) Demand. у, = ауу + osx, tu, 
(33.2A) Supply. yo = азу + Q4Xq + QGuxs + us. 


Observe that in this economic model, the values 1 and 0 have been pre- 
assigned to certain elements of Г and B. 
The reduced form is: 


(33.3) Quantity. y = пух + Tio1Xo t Tí 51Xs + Uis 
(33.4) Price. ja = WyoX, + Noxa + зох: + Ug. 


As previously shown, the reduced-form coefficients relate to the struc- 
tural-form coefficients via: 


T Th Qo 0203 
П = | т то | = (14) | оа, o, , with A = 1 — ооз. 
Tis; Tso 0105 Qs 


Reading this as a system of six equations in five unknowns, we see that 
if we were given the «'s, then we could solve uniquely for the a’s: 

©з = TiTi, Ai = тәтә = 134/155, 

A= 1 — ajas, о» = At, Q4 = Ат», Qs = Ато. 
We conclude that all the structural coefficients are identified in terms 


of the reduced-form coefficients. Furthermore, there is a restriction on 
the reduced-form coefficients, namely 


тәтә = i3 1/1155. 


It is not surprising to find one restriction on the т”, because all six т?з 
are functions of only five a’s. 


Actually, it is more convenient to analyze identification via ПГ = B, 
which.we write out here as: 
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fis; 139 0 оь 


Reading off, we see these six equations in five unknowns: 


(33.5А) m; — от = о, тә — Og = 0 
T; — AT = 0 тоо — Og ig, = A4 
T; — 04715, = 0 Tigo — Qs, = 05. 


On the left, which refers to the demand equation, we see three equations 
in two unknowns; on the right, which refers to the supply equation, we 
see three equations in three unknowns. 

It is easy to solve the system on the left of (33.5A). First solve either 
of the equations that has 0 on its right-hand side. Because of the restric- 
tion on П, they give the same answer, namely 


Ор = тото = Тзү/їзә. 
Insert that value for o, into the remaining equation to get 
Og = Tjj — Quir is. 


We conclude that the coefficients of the demand equation are identified 
in terms of II. 

It is also easy to solve the system on the right of (33.5A). First solve 
the equation that has 0 on its right-hand side to get 


Qs = Tio. 
Insert that value of a, into the remaining equations to get 
ом = тоо 77 0391, 05 = "sg 7 031731. 


We conclude that the coefficients of the supply equation are also iden- 
tified in terms of П. 

With respect to estimation, because there is a restriction on II, equa- 
tion-by-equation LS estimation of the reduced form will not be optimal: 
see Section 30.6. But if we estimate the reduced form subject to that 
restriction, then estimates of the ms can be converted into estimates of 
the огѕ using the sample counterpart of the system (33.5A). 

Model B. Modify the structural model by allowing the wage rate x2 to 
enter the demand equation: 
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(33.1B) Demand. y, = a,ys + osx, + Agxe taa, 
(33.2B) Supply. J2 = 03)! + ахо + 05х35 + us. 


The reduced-form equations are again (33.3)-(33.4) but now, in the 
ПГ = B format, the relation between reduced-form and structural 
coefficients is: 


T Wie 1 es oa, 0 
тәр Tee c 1 =| Q6 04 
1 


Tis) 1732 0 a, 
Reading off, we see these six equations in six unknowns: 
(33.5B) m, — «ут; = My Tiz — As, = 0 
Тәр — 011799 = 06 T22 7 ОЗТ = Aq 
Ts; — 01132 = 0 Tg — OTi; = 05. 


On each side we see three equations in three unknowns. On the left of 
(33.5B), solve the last equation for a, = 1тїзу/ттзә; insert that value into 
the first and second equations to get o and оњ. We conclude that the 
coefficients of the demand equation are identified in terms of II. On 
the right of (33.5B), solve the first equation for оз = 5/1; insert that 
value into the other two equations to get a, and a5. We conclude that 
the coefficients of the supply equation are also identified in terms of II. 
There are no restrictions on the ms, which is not surprising because the 
six ms are functions of six o’s. 

With respect to estimation, because there are no restrictions on II, 
the reduced form is a SUR model with identical explanatory variables, 
so equation-by-equation LS will coincide with GLS and hence be optimal. 
The LS estimates of the m’s can be converted into estimates of the a’s 
by using the sample counterpart of the system (33.5B). 

Model C. Modify the original structural model by allowing income x, 
to enter the supply equation: 


(33.1C) Demand. jy, = оу + оох, tu, 
(33.2C) Supply. Y2 = ©з, + 7X) + Q4 Xo t Q5Xs t Uo. 


The reduced-form equations are again (33.3)-(33.4) but now, in the 
IIT = B format, the relation between reduced-form and structural 
coefficients is: 
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Тур We Оз 0, 
1 ЕЕЕ 
Тәр Tae (_ > =|0 a, 


Reading off, we see these six equations in six unknowns: 


(33.5C) ту, — «үт, = Qg Tig — Өзү, = Ay 
Mg) — QT = 0 T22 — ОЗТ] = Aq 
Ta; — Q) Ts. = 0 Tso — ОЗТз] = 05. 


It is clear how to solve the system on the left of (33.5C), that is, to 
determine the parameters of the demand equation. First solve either of 
the equations that has a 0 on its right-hand side for a, = тзу/тзә = 
1/2/79», and then get a, from the remaining equation. So the coefficients 
of the demand equation are identified in terms of II. And there is a 
restriction on the тз, namely 113)/t39 = 115/155, which is not surprising 
because on the left of (33.5C) there are three equations in two 
unknowns. 

However, the system on the right of (33.5C) consists of three equations 
in four unknowns. We can assign any value to o and then solve for o,, 
о, Qy. A different arbitrary value for o4 would generate different values 
for o4, Q5, &;. The solution is not unique. Evidently, there are an infinity 
of alternative sets of values for the supply-equation coefficients that, in 
conjunction with the appropriate set of values for the demand-equation 
coefficients, produce the same set of values for the ms. Consequently, 
II does not contain enough information to uniquely deduce the Г and 
B that produced it. We conclude that the coefficients of the supply 
equation are not identified in terms of II. 

With respect to estimation, because there is a restriction on IL, LS 
estimation of the reduced form will not be optimal. If the reduced form 
is estimated subject to that restriction, then estimates of the demand 
equation can be derived. But in Model C, there is no way to estimate 
the supply equation: to seek estimates of its coefficients is not a mean- 
ingful task. 

To recapitulate: we have answered the question "Are the o's uniquely 
determined by the as?" for three variants of the supply-demand model. 
Because certain elements of Г and B were known a priori, the answer 
was sometimes yes. Indeed sometimes the knowledge was sufficient 
enough to restrict the set of admissible тз. We have seen that not only 
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the number of pieces of prior information but also their location is 
crucial to the answers. In our examples, the prior information consisted 
only of exclusions (zero coefficients) and normalizations (a 1 in each column 
of Г). In other simultaneous-equation models there may be additional 
pieces of prior information—for example, two structural coefficients 
may be known to be equal. Such information also serves to aid identi- 
fication and may even constrain the reduced-form coefficient matrix II. 


33.3. Uncorrelated Disturbances 


We have focused on getting Г and B from ПГ = B, but there is another 
relation between structural and reduced-form parameters that may 
assist identification, namely Q* = (Г )'5*Г-!, obtained as Eq. (B5) in 
Chapter 32. We may rewrite this as 


(336) 2* = oer. 


Like the coefficient matrix П, the disturbance variance matrix © is 
estimable from LS regression on the reduced form. Suppose that both 
II and О* are known. Can we exploit Eq. (33.6) to help deduce Г? In 
general, the answer is no. With £* unknown, Eq. (33.6) merely suffices 
to deduce £* once О* and Г аге known. However, there may be prior 
information on X* that reduces the number of unknowns in Eq. (33.6) 
and thus frees it to help in identifying Г. For example, suppose that 
the structural disturbances are known to be uncorrelated with one 
another. Then X* is diagonal, so there are only m unknown elements 
of X*, while Eq. (33.6) has m(m + 1)/2 distinct equations. 
To illustrate: for our supply-demand Model C, Eq. (33.6) is 


BE id = ( 1 “| be n ( 1 EU 
Сәј C22 mz 1 902; W22 Ta d 
The off-diagonal element is 


Tig = — 030 t Wig + 010309 — 0055. 


Suppose that the structural disturbances are known to be uncorrelated, 
SO тә = 0. Then 


(33.7) оз = (Wy. — Q,055)/(0;; — 01025), 
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so 0з is uniquely determined by the ws and о. Referring back to the 
analysis of Model C in Section 33.2, we see that this will suffice to 
complete identification of the supply equation. So in this case, all the 
structural coefficients will be identified in terms of ЇЇ and Q*. 

With respect to estimation, because the w’s are estimable from the 
residuals of LS regression on the reduced form, the sample counterpart 
of Eq. (33.7) will in this situation be usable along with the sample 
counterpart of system (33.5C). 

A leading special case, known as the fully recursive model, arises when 
X* is diagonal and Г is triangular. Here, all off-diagonal elements in X* 
are zero, and in Г all elements below the diagonal are zero. Then not 
only are all the structural parameters identified, but in fact they are 
estimable by LS regression on the structural equations. If Г is upper- 
triangular, then I~’ will also be upper-triangular, so (I ))' will be lower- 
triangular. In conjunction with the diagonality of $*, this will imply that 


С(у, u) = (Г yx* 


is lower-triangular. Any endogenous variable on the right-hand side of 
a structura] equation will be uncorrelated with the disturbance in that 
equation. If so, each structural equation is a CEF (or at least a BLP), 
hence identified, and indeed estimable by LS. 


33.4. Other Sources of Identification 


We have seen that the reduced-form coefficient matrix II and distur- 
bance variance matrix ()* may both be used to identify structural 
parameters in the SEM. Can anything else be used? The answer is 
effectively no. After all, the most one can hope to learn from stratified 
sampling is go(y|x), the joint-conditional pdf (or pmf) of the endoge- 
nous variables given the exogenous variables. If we learn that distribu- 
tion, then we will know E(y'|x) = х'П and V(ylx) = Q*. If y|x is 
multinormal, then there is nothing more to learn: knowledge of both 
П and Q* is equivalent to knowledge of go(y|x). So if a structural 
parameter is not identified in terms of H and *, then it is not identifed 
in terms of gs(y|x); that is, it is not identified. To be sure, if gly |x) 
were known to be nonnormal, then there might be more information 
available, but that situation is rare indeed, and we ignore it here. In 
random sampling, we can also learn f(x), the joint-marginal pdf (or 
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pmf) of the exogenous variables, but the structural parameters do not 
enter that function. 

In the next chapter, we will proceed on the presumption that the only 
prior information consists of exclusions and normalizations on Г and 
B. Then II is the sole source of information available for identifying 
the unknown structural coefficients. The remaining task will be to obtain 
estimates of II that are convertible into estimates of the unknown, but 
identified, elements of Г and B. 


Exercises 


33.1 Consider the simultaneous-equation model 


Jı ay» + QX) + uy, 


Jo = 03) + о4хә + us, 


where the exogenous variables x, and x, are independent of the dis- 
turbances иу and us. The reduced form of the model is 


yi = тух + ухо toU, 
Je = REUSI + Ti 4Xo + Ug. 
a) You are told that m, = 1, т = 4, т = —2, п, = 2. Determine 
2 


the values of о), o», Qs, o. 
(b) You are also told that ху, x», ui, иь are independent .N(0, 1) 
variables. Predict the value of y, that will occur if y = x, = 1. 


33.2 A simple theoretical model for the labor market consists of the 
supply function H = S(W, N) and the demand function W = D(H, X), 
where the endogenous variables are H = hours worked and W = wage 
rate, while the exogenous variables are N — family size and X — worker 
characteristics. The economic presumptions are that 0$/0W > 0 and 
8D/aH < 0. A linear version is 
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ji = оуу t Agxy + Oxo + uy, 

J2 = 04у; t ax; + 05х35 + ох, + Agr, + to, 
where у, = months worked, y, = wage rate, x, = 1, x, = family size, 
хз = education, x, = age, x, = race (= 1 if black, = 0 if white), u, = 


supply shock, из = demand shock. Suppose that the SEM model applies. 
Write out the system IIT = B, and analyze identification and restrictions. 


34 Estimation in the Simultaneous-Equation 
Model 


34.1. Introduction 


We proceed to methods for estimating the structural parameters in the 

SEM. We continue to confine attention to the two-equation case, with 

some specialization to our supply-demand models. The only types of 

prior information that we allow for are exclusions and normalizations. 
For the population, the model has 


Е(у' |х) = х1, V(y|x) = О*, 


where П = (ту, т») is k X 2, and Q* = {w,,} is 2 X 2 and positive 
definite. Reading off, we have 


E(y,|x) = xm,  E(y»|x)-x'm 
Voix) = eu ^ Vos) ex Су, Yel) = Ore. 


We sample by the stratified scheme, so that the n X k matrix X is 
nonstochastic and has rank k, while the n X 2 matrix Y = (y,, у») is 
random. We have 


E(y,) = Xm, E(y;) = Кт», 
V(y) = ol, V(ys) = Wool, C(ys, Yo) = Mol. 


Except for notation, this is precisely the SUR (regression systems) model 
of Chapter 30. To stack the two equations, let 


— {У „_{[Х О 
7 ae à io x. 
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GO» о1/ TaJ 
Then E(y) = X?n, V(y) = О = Q* Q I is positive definite, and X° is 


nonstochastic with full column rank. (Caution: y now denotes the 2n x 
1 vector of observations, rather than the original 2 X 1 random vector.) 


34.2. Indirect Feasible Generalized Least Squares 


Because the SUR model applies, GLS is the natural estimation proce- 
dure. First, suppose that О*, and hence Q, is known. As the estimator 
of п, we would choose the vector с that minimizes the GLS criterion 
$(c) = v'Q^! v, where v = y — X'c. 

If there are no restrictions on т, then we have a SUR model with 
identical explanatory variables, and the solution is obvious: GLS reduces 
to equation-by-equation LS, as shown in Section 30.3. In the present 
notation, this means that the GLS estimator of т is 


RS 
P2) \Ау»/' 
with A = Q"'X', Q = X'X. Reassembling, the GLS estimator of H is 


Р = (ру, p») = (Ayı, Aye) = А(у,, уг) = AY. 


The implied estimates of the structural parameters follow by solving 
the sample counterpart of ПГ = B, namely 


(34.1) Pf = B, 


for the unknown elements of Ê and B. That is to say, do in the sample 
what we did in the population for Model B in Section 33.2. 

If there are restrictions on т, then those should be imposed in the 
minimization. The most convenient way to impose them is to solve them 
out, which amounts to expressing the 7's in terms of o's and choosing 
estimates of the (unrestricted) o's. Consider, for example, Model A of 
Chapter 33. Here 


Tj Wy Qe 0203 
H =| т т |= (ИА) { азо, o; , with A = 1 — ооз. 
Ts; "go QA, Os 
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Write m = (туу, 791, 1731, 19, Tee, Mao)’ and @ = (Q), ais, Og, 04, Os)’. 
Let т = g(a) be the mapping from the true structural coefficients to 
the true reduced-form coefficients. Correspondingly, write c = (¢),..., 
сѕ)' and a = (a;,..., 45)’. Then c = g(a) is the mapping from the choice 
vector (estimator) for the structural coefficients to the choice vector 
(estimator) for the reduced-form coefficients. Referring to the display 
above, this mapping is 


с = a(l — ауаз) C4 = азаз[(1 — ауаз) 
Co = a\a,/(1 — ауаз) cs = a(l — ааз) 
Сз = a,a./(1 — ауаз) Cg = as/(1 — ауаз). 


We propose to estimate a by the vector a that minimizes the GLS 
criterion 


(a) = é[g(a)] = у'ОС'у, 
with 
v = y — X'g(a). 


The associated estimate of п will be c = g(a). 

Computationally, it may be convenient to transform this into an LS 
problem. To do so, let H* be the 2 x 2 matrix such that H*'H* — 
Q*^!, and let Н = H* Q L,. Then, as is easily verified, H'H = О. 
With such an H matrix in hand, we can rewrite the GLS criterion as 


(a) = v*'v*, 
with 
v* = Hv = Hy — HX'g(a) = y* — X™*g(a), 


say. We take as our estimates of the a’s the values of the a’s that minimize 
(a). In view of the form of g(a), this is a nonlinear least squares 
problem, so the algorithm discussed in Sections 29.2 and 29.3 may be 
used. 

Next suppose that, as in practice, О*, and hence Q, is unknown. The 
natural procedure is feasible generalized least squares. The FGLS algo- 
rithm will parallel the GLS algorithm, except that an estimator © is 
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used in place of Q. The estimator comes from the residuals of the LS 
reduced-form regressions. More explicitly, let 


%= у; – Хр, = Му; (j-12) 
where М = I – XA, and let 

Ў = Y — XP = МҮ = (%;, 4). 
Then 

Q* = (n)V'V 


is the estimator for Q*, and Ô = OF G9 I is the estimator for О. 

The rest of the computational algorithm can then track that for GLS. 
If there are no restrictions on %, then taking c = p (that is, П = P) will 
solve the minimization problem. (In a SUR model with identical explan- 
atory variables, FGLS, like GLS, coincides with LS.) If there are restric- 
tions оп т, an NLLS algorithm is usable. We refer to the resulting 
estimates of the 07 as indirect feasible generalized least squares, or indirect- 
FGLS, estimates, because we are in effect estimating the т” by FGLS and 
converting them into estimates of the a’s. (If there are no restrictions 
on т, then indirect-FGLS coincides with indirect least squares.) In the 
literature, the indirect-FGLS procedure is sometimes referred to as a 
minimum-distance procedure. 

With respect to sampling properties: because the FGLS estimates of 
П are consistent and BAN, the indirect-FGLS estimates of B and Г are 
also consistent and BAN. So indirect-FGLS is one preferred way to 
estimate structural parameters in the SEM. 

Several remarks about the indirect-FGLS procedure: 

* In our algorithm, we use the relation between т'ѕ and o's to solve 
out the restrictions, reducing the problem to unconstrained, but non- 
linear, minimization. It is easy to see that the resulting estimates satisfy 
the sample counterpart of ПГ = B. 

* If one or more of the structural equations is not identified in terms 
of П, then the indirect-FGLS procedure will break down, as it should. 

* If in the population, the conditional distribution of the m x 1 
random vector y, given the Ё X 1 vector x, is multinormal, then max- 
imum-likelihood estimation is available. For historical reasons, this 
method is known as full-information maximum likelihood, or FIML. From 
the discussion in Section 30.7, we can verify several facts. If О* is known, 
then FIML coincides with GLS. If О* is unknown, then FIML minimizes 
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|V'V|, which differs from the FGLS criterion tr(Q*^' V'V), but the 
estimators have the same asymptotic distribution. Iterating indirect- 
FGLS until convergence is an algorithm for solving the FOC’s of FIML 
estimation. 

The indirect-FGLS and FIML procedures are computationally com- 
plex when there are restrictions, because the restrictions are character- 
istically nonlinear in the тп”. As a consequence, structural estimation 
procedures have been developed that use the unrestricted reduced- 
form estimate, Р = (p,, p»), in a quite different way. Nowadays the 
complexity is less of a concern, but the simpler methods are widely used, 
and therefore we will sketch two of them. 


34.3. Two-Stage Least Squares 


The two-stage least squares, or 2SLS, method is the most popular procedure 
for estimating a simultaneous-equation model. Its mechanics can be 
described very simply. In the first stage, each endogenous variable is 
regressed on all the exogenous variables, and fitted values are obtained. 
In the second stage, each structura] equation is taken in turn, right- 
hand-side endogenous variables are replaced by their fitted values, and 
LS is run. The 2SLS algorithm does not involve nonlinear optimization, 
which accounts for its popularity. 

We describe the procedure explicitly in terms of the supply-demand 
models of Chapter 33. The data consist of the n x 2 matrix Y = (y,, yə) 
and the n X k matrix X = (x,,..., x,). The familiar regression matrices 
are 


Q-XX, A-Q'X, N-XA, M-I-N. 
We have 

AY = (Ay), Ay.) = (py, P2) = Р, AX = I, 

NY = (Nyy, Му) = n ўз) = Ў, МХ =X. 


Focus on one of the structural equations, say the demand equation 
in Model B. In population terms this is 


Yi = ауу t аху + Ggxo + ш. 


For the sample of size n it is 
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Yi = Yo, + х0 + Xo, + U 
= (уә, Xi, X2) | о | + uy 


= VATUM t и, 


say, where Z, = (ys, х], X9) is n X 3, and a, = (Qj, Q2, Q6)’ is 5 x 1. 
Regressing y, on Z, would give the normal equations 


ZiZa = Ziyi 


the solution to which is the LS coefficient vector a, = (ZZ) !7\у\. As 
we know, this is not a sensible estimator of a. 
Instead, replace Z, by 


2, = NZ, = N(ys, xi, xy) = (Nya, Nx,, NX) = (js, ху, xs), 
and regress y, on 2,. This gives the normal equations 
ZiZiat = Ziyi, 
the solution to which is the 2SLS estimator 
at = (2:12) ‘Ziyi. 
The (asymptotic) variance matrix of af is estimated as 
Vat) = б(@2)1, 
where 
бу = ef'ef/(n — k*), 


with k* being the number of right-hand-side variables (columns of Z,), 
and 


AK = -— * 
ei — yi Z, af. 


Observe that the original values Z, are used in calculating residuals, 
even though the fitted values 2, were used in calculating coefficients. 
(In calculating 6,,, division by n rather than n — &* is also acceptable 
in view of the fact that asymptotic theory is being relied on.) 

There are at least two heuristic rationales for the 2SLS procedure, 
which we exposit in the context of the demand equation of Model B. 
First, observe that 
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E(yilx) = ajE(yo|x) + axı + охо 
= ay + yx, t 06%, 
where 
yf = XT = E(ys|x). 


So a sample LS regression of y, on Z¥ = (уў, x,, x») would give unbiased 
estimates of о. That regression cannot be run because у$ is unobserved. 
Still, р unbiasedly and consistently estimates т, whence the fitted- 
value vector ўз unbiasedly and consistently estimates the conditional 
expectation vector y. Making the natural replacement, ў: for y£, suf- 
fices to produce consistent estimates of the «,. The second rationale is 
simpler. The 2SLS normal equations are equivalent to a set of orthog- 
onality conditions: 


Ziu, = 0, 
where u, = y, — Z,af. To show equivalence, use the algebraic fact: 
ZZ, = ZIN'Z, = ZIN'NZ, = 2\2. 


So 2SLS has an instrumental-variable interpretation. The variables in 
Z, are legitimate instruments because they are, at least asymptotically, 
uncorrelated with the disturbance. 

The 2SLS procedure may be applied to each of the structural equa- 
tions in turn. The fact that it relies on the unrestricted estimator P as 
the estimator of II suggests that when there are restrictions on II, the 
2SLS estimates will not be optimal. 

Here are several remarks about the mechanical aspects of 2SLS esti- 
mation: 

* If a structural equation is not identified in terms of II, then the 
2SLS procedure will break down, as it should, for that equation. For 
example, consider the supply equation in Model C. The second stage 
of 2SLS calls for regressing ys on ($,, хі, хә, xs), but there is exact 
multicollinearity among those four explanatory variables: 


ў. = Xp, = Xi Py. + хәр + Xofsi- 
So the solution to the second-stage normal equations is not unique, and 
the 2SLS estimates are not defined. 


* Standard errors for 2SLS cannot be smaller than the conventional 
standard errors for LS obtained from 
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su(ZZ) , with s, = (у, — Zai)’ (yi — Zia/(n — 9). 


First, бу > s, because LS minimizes the sum of squared residuals of 
yı from a linear combination of the columns of Z,. Second, ZiZ, = 
2:2, because Z, = NZ. 

* One should not report R? for equations estimated by 2SLS. If one 
uses the conventional sum of squared residuals, then one is measuring 
the proportion of variation in the dependent variable that is accounted 
for by the fitted explanatory variables. Alternatively, if one uses the sum 
of squared residuals that enters the 2SLS estimate of с, then there is 
no guarantee that the resulting R? will lie between 0 and 1. 


34.4. Relation between 2SLS and Indirect-FGLS 


Relying on P, the unrestricted estimator of II, how does 2SLS succeed 
in producing estimates of Г and B even when there are restrictions оп 
II? To explore that issue, we study the algebraic relation between the 
2SLS and indirect-FGLS estimators. Consider first the demand equation 
of Model B. We have 


луу = ZINy; = Z(XA)yi = ZiXAy, = ZiXpi, 
2\2, = ZINZ, = 21(ХА)2 = ZiXAZ, = ZiX(ps, D), 
where D consists of the first two columns of AX = I: 


1 0 
D = A(X), Xo) == 0 1 Е (d;, dj), 


say. So the normal equations of 2SLS, namely 
у = ZiZaf, 
сап be read as 


аў 
21Хр; = ZiX(ps D) | аў | = ZiX(psat + фаў + dat), 
ag 


which may be rearranged into 
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aj 
ZiX(p; — peat) = ZiX(dia$ + азай) = ZiX | аў 
0 


Here Z/X is square and (coincidence apart) nonsingular, so the 2SLS 
normal equations are equivalent to 


ag 


(34.2) р, — peat = | аё 
0 


Similarly, for the supply equation of Model B, the 2SLS normal equa- 
tions are equivalent to 


0 
(34.8 po – раў = | oi 
as 
Assembled together, Eqs. (34.2) and (34.3) say 
РЇ = B, 
which is Eq. (34.1), the indirect-FGLS (and ILS) estimating equations 
when there are no restrictions to be imposed. We conclude that for 
Model B, which has no restriction on II, 2SLS coincides with indirect- 
FGLS. 
When restrictions are present, this coincidence will not prevail. For 
example, consider the demand equation in Model A. Suppose that we 


tried to use the #’s instead of the #° in the ILS estimating equations, 
writing 


that is, 


Pi (po. di) Pa E 


This system of three equations in two unknowns is overdetermined: it 
has no solution. We might combine the equations by premultiplying 
through by any 2 х 3 matrix. One such matrix is 21Х. Premultiplying 
through by it gives 
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; & 
ZiXp, = ZıX(pə, d) ( , 
2 
which is a system of two equations in two unknowns. Now 
ZXp = ду,  ZiX(pə di) = 2. 


so we have arrived at the normal equations for 2SLS. From this per- 
spective, when restrictions are present, there is a surplus of estimating 
equations. The 2SLS estimates can be viewed as the solution to a col- 
lapsed set of those equations. It turns out that collapsing via ZiX is 
optimal: see Amemiya (1985, pp. 239—240). 


34.5. Three-Stage Least Squares 
In 2SLS, we estimate each structural equation separately, acting as if 
classical regression models applied to 

y1 = РАСИ + uj, Y2 = 2,05 + Uo. 


There would seem to be an advantage to estimating the pair of structural 
equations jointly. If we stack into 


O GD e. e 
4 Hr o 2,' %/', ч;/? 

then у = 2° + и has the appearance of a SUR model. If 2* (hence X) 

were known, we might calculate an estimator by the GLS rule, namely 
a= ies 2) 2» 

Lacking that knowledge, we may adopt an FGLS rule. Estimate &* from 

the proper residuals of 2SLS: 


e et'et et'ei 
* = 
t= am (3% aa 
and construct $ from $*. Then calculate 
at* = qe 29 099$ А 


This defines the three-stage least squares, or 3SLS, estimator of the struc- 
tural parameters. The LS rule is used three times—first on the reduced 
form (to get the Z's), next on individual structural equations (to get the 
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e*'s), and finally on the structural equations jointly (to get the a**). 
Clearly, 3SLS will break down (as it should) if the system contains a 
nonidentified structural equation. 

It can be shown that the 3SLS estimator is consistent and BAN, like 
indirect-FGLS. The computations for 3SLS, like those for 2SLS, do not 
involve nonlinear optimization, even when restrictions are present. The 
inverse matrix in the formula above for a** serves as the estimate of 
the (asymptotic) variance matrix of a**. 


34.6. Remarks 


We conclude with some remarks on estimation in the SEM. 

* If all the structural equations are identified, and there are no restric- 
tions on II, then indirect least squares, indirect-FGLS, 2SLS, 3SLS, and 
FIML all produce the same estimates. 

* If a parameter is not identified, then there is no method to estimate 
it consistently. 

* In the SEM, there is in general no unbiased estimator of the struc- 
tural parameters. 

* Throughout this chapter, we have confined attention to an SEM in 
which the only prior information consists of normalizations and exclu- 
sions. If other information is available (for example, X* is diagonal, or 
a coefficient in one structural equation is equal to a coefficient in 
another), then some modifications are needed in the description of the 
estimators and their statistical properties. 

* We have relied on stratified (nonstochastic X) sampling in this 
chapter. The statements about asymptotic properties rely on an addi- 
tional assumption about how additional observations are generated, 
namely that the matrix X'X/n has a positive definite limit: see Section 
22.7. If instead sampling is random from the joint distribution of (x', y"), 
no substantial change in the results would be required: see Chapter 25. 


Exercises 
34.1 You are given the following sums of squares and cross-products 


on the variables у, = quantity, уә = price, x = income, obtained in a 
sample of 60 observations: 


376 34 Estimation in the SEM 


x Э Jo 
x 360 120 120 
h 120 110 5 
Jy» 120 5 80 


You are told that the sample was produced by this simultaneous-equa- 
tion model: 


Demand. y, = Qiyo t бух + щш, 
Supply уз = egy T us, 


in which the exogenous variable x was independent of the disturbances 
u, and us, while those two disturbances had zero expectations and were 
correlated with each other. 


(a) From the sample data, calculate the LS “estimate” of оз, and the 
2SLS estimate of оз. 

(b) Would you use 2SLS, or some other method, to estimate a, and 
as? Justify your answer. 

(c) From the information in hand, you are asked to predict the value 
of ys that will prevail when y; = 55. Would your prediction be 
2.5, or 55, or some other number? Justify your answer. 


34.2 The usual simultaneous-equation model applies to 
yi = Gs + ахі ta, 
уә = 037 + O4Xo + Up. 


Here у, = quantity, yy = price, x, = input price, and xy = income. These 
two LS regressions were obtained in a sample of 100 observations: 


fi = -6x, + 2xs, 
$o = 3x, + Xo. 
Calculate estimates of the a's. 


34.3 In Exercise 33.2, you considered this supply-demand model for 
labor: 


n = Qo + ах] + QsXo + Uy, 


J2 = 44у t Asx, + QgxXs + QuX4 + 8х5 + Uo, 
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where у = months worked, y, = wage rate, x; = l, x» = family size, 
хз = education, x, = age, x, = race (= 1 if black, = 0 if white), ш = 
supply shock, ug = demand shock. Now suppose, rather artificially, that 
this SEM applies to the SCF data set of Exercise 17.4. Take y, = wage 
rate = earnings/months worked. 

Write and run a program to: 


(a) Calculate the LS “estimates” of the structural coefficients, along 
with their conventional standard errors. 

(b) Calculate the 2SLS estimates of the structural coefficients, along 
with their standard errors. 

(c) Discuss your results from an economic perspective. 


34.4 For the setup of Exercise 34.3, write and run programs to: 


(a) Calculate the 3SLS estimates of the structural coefficients. 

(b) Calculate the indirect-FGLS estimates of the structural coeffi- 
cients. 

(c) Assuming normality, calculate the FIML estimates of the struc- 
tural coefficients, by iterating the indirect-FGLS algorithm until 
convergence. 

(d) Also comment on the relation among your alternative estimates. 


34.5 For the setup of Exercise 34.3: 


(a) Use your 2SLS estimates to derive an estimate of the reduced- 
form coefficient matrix П. 

(b) Does this estimate of II satisfy the restrictions that you found in 
Exercise 33.2? Explain. 

(c) Compare your estimated П with the unrestricted estimate P. 
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Appendix A 
Statistical and Data Tables 


Table A.1 Standard normal cumulative distribution function. 


0.00 0.01 0.02 0.08 004 005 006 0.07 0.08 0.09 


0.00 0.500 0.504 0.508 0.519 0.516 0.520 0.524 0.528 0.532 0.536 
0.10 0.540 0.544 0.548 0.559 0.556 0.560 0.564 0.567 0.571 0.575 
0.20 0.579 0.583 0.587 0.591 0.595 0.599 0.603 0.606 0.610 0.614. 
0.30 0.618 0.622 0.626 0.629 0.633 0.637 0.641 0.644 0.648 0.652 
0.40 0.655 0.659 0.663 0.666 0.670 0.674 0.677 0.681 0.684 0.688 


0.50 0.691 0.695 0.698 0.702 0.705 0.709 0.712 0.716 0.719 0.722 
0.60 0.726 0.729 0.732 0.736 0.739 0.742 0.745 0.749 0.752 0.755 
0.70 0.758 0.761 0.764 0.767 0.770 0.773 0.776 0.779 0.782 0.785 
0.80 0.788 0.791 0.794 0.797 0.800 0.809 0.805 0.808 0.811 0.813 
0.90 0.816 0.819 0.891 0.824 0.826 0.829 0.831 0.834 0.836 0.839 


1.00 0.841 0.844 0.846 0.848 0.851 0.853 0.855 0.858 0.860 0.862 
110 0.864 0.867 0.869 0.871 0.873 0.875 0.877 0.879 0.881 0.883 
1.90 0.885 0.887 0.889 0.891 0.893 0.894 0.896 0.898 0.900 0.901 
1.80 0.903 0.905 0.907 0.908 0.910 0.911 0.913 0.915 0.916 0.918 
1.40 0.919 0.991 0.999 0.924 0.925 0.926 0.928 0.929 0.931 0.932 


1.50 0.933 0.934 0.936 0.937 0.938 0.939 0.941 0.942 0.943 0.944 
1.60 0.945 0.946 0.947 0.948 0.949 0.951 0.952 0.953 0.954 0.954 
1.70 0.955 0.956 0.957 0.958 0.959 0.960 0.961 0.962 0.962 0.963 
1.80 0.964 0.965 0.966 0.966 0.967 0.968 0.969 0.969 0.970 0.971 
1.90 0.971 0.972 0.973 0.973 0.974 0.974 0.975 0.976 0.976 0.977 


2.00 0.977 0.978 0.978 0.979 0.979 0.980 0.980 0.981 0.981 0.982 
2.10 0.982 0.983 0.983 0.983 0.984 0.984 0.985 0.985 0.985 0.986 
2.20 0.986 0.986 0.987 0.987 0.987 0.988 0.988 0.988 0.989 0.989 
2.80 0.989 0.990 0.990 0.990 0.990 0.991 0.991 0.991 0.991 0.992 
2.40 0.992 0.999 0.992 0.992 0.993 0.993 0.993 0.993 0.993 0.994 


2.50 0.994 0.994 0.994 0.994 0.994 0.995 0.995 0.995 0.995 0.995 
2.60 0.995 0.995 0.996 0.996 0.996 0.996 0.996 0.996 0.996 0.996 
2.70 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 
2.80 0.997 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.998 
2.90 0.998 0.998 0.998 0.998 0.998 0.998 0.998 0.999 0.999 0.999 


3.00 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.999 


Example: If Z ~ №0, 1), then Pr(Z = 1.15) = F(1.15) = 0.875. 
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Table A.2 Chi-square cumulative distribution function. 


С) 

k 005 010 015 020 025 0.30 035 040 0.45 0.50 0.55 

1 0.00 0.02 004 006 010 0.15 021 0.27 0.36 0.45 0.57 
2 0.10 0.21 0.33 0.45 0.58 0.71 0.86 1.02 1.20 1.39 1.60 
з 0.35 0.58 0.80 1.01 1.21 1.42 1.64 1.87 2.11 2.87 2.64 
4 0.71 1.06 1.37 1.65 192 2.19 247 2.75 3.05 3.36 3.69 
5 1.15 1.61 1.99 234 2.67 3.00 3.33 3.66 4.00 4.35 4.73 
6 164 220 266 3.07 345 383 420 457 4.95 5.35 5.77 
7 2.17 2.83 3.36 382 425 4.67 508 5.49 5.91 6.35 6.80 
8 9.73 349 408 459 5.07 5.53 598 642 6.88 7.34 7.83 
9 3.33 417 482 538 590 639 688 7.36 7.84 8.34 8.86 
10 3.94 4.87 557 618 674 7.27 778 830 8.81 9.34 9.89 
11 4.57 558 634 699 758 8.15 870 924 978 10.34 10.92 
12 5.23 630 7.11 7.81 8.44 9.03 9.61 10.18 10.76 11.34 11.95 
13 5.89 7.04 7.90 863 9.30 9.93 10.53 11.13 11.73 12.34 12.97 
14 6.57 7.79 8.70 9.47 10.17 10.82 11.45 12.08 12.70 13.34 14.00 
15 7.96 8.55 9.50 10.31 11.04 11.72 12.38 13.03 13.68 14.34 15.02 
16 7.96 9.31 10.31 1115 11.91 12.62 13.31 13.98 14.66 15.34 16.04 
17 867 10.09 11.12 12.00 12.79 13.53 14.24 14.94 15.63 16.34 17.06 
18 9.39 10.86 11.95 12.86 13.68 14.44 15.17 15.89 16.61 17.34 18.09 
19 10.12 1165 12.77 18.72 14.56 15.35 16.11 16.85 17.59 18.34 19.11 
20 10.85 1244 13.60 14.58 15.45 16.27 17.05 17.81 18.57 19.34 20.13 
25 14.61 16.47 17.82 18.94 19.94 20.87 21.75 22.62 23.47 24.34 25.22 
30 18.49 20.60 22.11 23.36 2448 25.51 26.49 27.44 28.39 29.34 30.31 
35 22.47 24.80 26.46 27.84 29.05 30.18 31.25 32.28 33.31 34.34 35.39 
40 26.51 29.05 30.86 32.34 33.66 34.87 36.02 37.13 38.23 39.34 40.46 
45 30.61 33.35 35.29 36.88 38.29 39.58 40.81 42.00 43.16 44.34 45.53 
50 34.76 37.69 39.75 41.45 42.94 44.31 45.61 46.86 48.10 49.33 50.59 
55 38.96 42.06 44.24 46.04 47.61 49.06 50.42 51.74 53.04 54.33 55.65 
60 43.19 46.46 48.76 50.64 52.29 53.81 55.24 56.62 57.98 59.33 60.71 
65 47.45 50.88 53.29 55.26 56.99 58.57 60.07 61.51 62.92 64.33 65.77 
70 51.74 55.33 57.84 59.90 61.70 63.35 64.90 66.40 67.87 69.33 70.82 
75 56.05 59.79 62.41 64.55 66.49 68.13 69.74 71.29 72.81 74.33 75.88 
80 60.39 64.28 66.99 69.21 71.14 72.92 74.58 76.19 77.76 79.33 80.93 
85 64.75 68.78 71.59 73.88 75.88 77.71 79.43 81.09 82.71 84.33 85.98 
90 69.13 73.29 76.20 78.56 80.62 82.51 84.29 85.99 87.67 89.33 91.02 
95 73.52 77.82 80.81 83.25 85.38 87.32 89.14 90.90 92.62 94.33 96.07 
100 77.93 82.36 85.44 87.95 90.13 92.13 94.00 95.81 97.57 99.33 101.1] 


Example: If W ~ x°(6), then Pr(W = 4.20) = G,(4.20) = 0.35. 
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- 
Qu oco мло COR OP н 


0.60 


0.71 
1.83 
2.95 
4.04 
5.13 


6.21 
7.28 
8.35 
9.41 
10.47 


11.58 
12.58 
13.64 
14.69 
15.73 


16.78 
17.82 
18.87 
19.91 
20.95 


26.14 
31.32 
36.47 
41.62 
46.76 


51.89 
57.02 
62.18 
67.25 
72.36 


77.46 
82.57 
87.67 
92.76 
97.85 


102.95 


0.65 


0.87 
2.10 
3.28 
444 
5.57 


6.69 
7.81 
8.91 
10.01 
11.10 


12.18 
13.27 
14.35 
15.42 
16.49 


17.56 
18.63 
19.70 
20.76 
21.83 


27.12 
32.38 
37.62 
42.85 
48.06 


53.26 
58.45 
63.63 
68.80 
73.97 


79.13 
84.28 
89.43 
94.58 
99.72 


104.86 


0.70 


1.07 
2.41 
3.66 
4.88 
6.06 


7.23 
8.38 
9.52 
10.66 
11.78 


12.90 
14.01 
15.12 
16.22 
17.32 


18.42 
19.51 
20.60 
21.69 
22.77 


28.17 
33.53 
38.86 
44.16 
49.45 


54.72 
59.98 
65.23 
70.46 
75.69 


80.91 
86.12 
91.32 
96.52 
101.72 


106.91 
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0.75 


1.32 
2.77 
4.11 
5.39 
6.63 


7.84 
9.04 
10.22 
11.39 
12.55 


13.70 
14.85 
15.98 
17.12 
18.25 


19.37 
20.49 
21.60 
22.72 
23.83 


29.34 
34.80 
40.22 
45.62 
50.98 


56.33 
61.66 
66.98 
72.28 
77.58 


82.86 
88.13 
93.39 
98.65 
103.90 


109.14 


0.80 


1.64 
3.22 
4.64 
5.99 
7.29 


8.56 
9.80 
11.03 
12.24 
13.44 


14.63 
15.81 
16.98 
18.15 
19.31 


20.47 
21.61 
22.76 
23.90 
25.04 


30.68 
36.25 
41.78 
47.27 
52.73 


58.16 
63.58 
68.97 
74.35 
79.71 


85.07 
90.41 
95.73 
101.05 
106.36 


111.67 


GC) 
0.85 


2.07 
3.79 
5.32 
6.74 
8.12 


9.45 
10.75 
12.03 
13.29 
14.53 


15.77 
16.99 
18.20 
19.41 
20.60 


21.79 
22.98 
24.16 
25.33 
26.50 


32.28 
37.99 
43.64 
49.24 
54.81 


60.35 
65.86 
71.34 
76.81 
82.26 


87.69 
93.11 
98.51 
103.90 
109.29 


114.66 


0.90 


2.71 
4.61 
6.25 
7.78 
9.24 


10.64 
12.02 
13.36 
14.68 
15.99 


17.28 
18.55 
19.81 
21.06 
22.31 


23.54 
24.77 
25.99 
27.20 
28.41 


34.38 
40.26 
46.06 
51.81 
57.51 


63.17 
68.80 
74.40 
79.97 
85.53 


91.06 
96.58 
102.08 
107.57 
113.04 


118.50 


0.95 


3.84 
5.99 
7.81 
9.49 
11.07 


12.59 
14.07 
15.51 
16.92 
18.31 


19.68 
21.03 
22.36 
23.68 
25.00 


26.30 
27.59 
28.87 
30.14 
31.41 


37.65 
43.77 
49.80 
55.76 
61.66 


67.50 
73.31 
79.08 
84.82 
90.53 


96.22 
101.88 
107.52 
113.15 
118.75 


124.34 


0.975 


5.02 
7.38 
9.35 
11.14 
12.83 


14.45 
16.01 
17.53 
19.02 
20.48 


21.92 
23.34 
24.74 
26.12 
27.49 


28.85 
30.19 
31.53 
32.85 
34.17 


40.65 
46.98 
53.20 
59.34 
65.41 


71.42 
77.38 
83.30 
89.18 
95.02 


100.84 
106.63 
112.39 
118.14 
123.86 


129.56 
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0.990 


6.63 
9.21 
11.34 
13.28 
15.09 


16.81 
18.48 
20.09 
21.67 
23.21 


24.72 
26.22 
27.69 
29.14 
80.58 


32.00 
33.41 
34.81 
36.19 
37.57 


44.81 
50.89 
57.84 
63.69 
69.96 


76.15 
82.29 
88.38 
94.42 
100.43 


106.39 
112.33 
118.24 
124.12 
129.97 


135.81 


0.995 


7.88 
10.60 
12.84 
14.86 
16.75 


18.55 
20.28 
21.95 
23.59 
25.19 


26.76 
28.30 
29.82 
31.32 
32.80 


34.27 
35.72 
37.16 
38.58 
40.00 


46.93 
53.67 
60.27 
66.77 
73.17 


79.49 
85.75 
91.95 
98.11 
104.21 


110.29 
116.32 
122.32 
128.30 
134.25 


140.17 
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Table A.3 SCF data set. 


V1 = ID number V5 = Experience V9 = Earnings 
V2 — Family size V6 = Months worked V10 = Income 
V3 = Education V7 = Race V11 = Wealth 
V4 = Age V8 = Region V12 = Savings 
VI v2 V3 V4 V5 V6 V7 V8 v9 V10 Vil V12 
1 4 2 40 33 12 2 3 1.920 1.920 0.470 0.030 
2 4 9 33 19 12 1 1 12.403 12.403 3.035 0.874 
3 2 17 31 9 12 1 4 5.926 6.396 2.200 0.370 
4 3 9 50 36 12 1 2 7.000 7.005 11.600 1.200 
5 4 12 28 11 12 1 3 6.990 6.990 0.300 0.275 
6 4 13 33 15 12 1 1 6.500 6.500 2.200 1.400 
7 5 17 36 14 12 1 3 26.000 26.007 11.991 31.599 
8 5 16 44 23 12 1 1 15.000 15.363 17.341 1.766 
9 5 9 48 34 12 2 3 5.699 14.999 9.852 3.984 
10 5 16 31 10 12 1 3 8.820 9.185 8.722 1.017 
11 10 9 41 27 12 1 4 7.000 10.600 0.616 1.004 
12 4 10 41 26 12 1 1 6.176 12.089 23.418 0.687 
13 7 11 36 20 12 1 2 6.200 6.254 7.600 ~0.034 
14 5 14 31 12 12 1 3 5.800 9.010 0.358 ~1.389 
15 5 7 27 15 12 1 2 6.217 6.217 0.108 1.000 
16 5 8 42 29 12 1 2 5.500 5.912 5.560 1.831 
17 4 12 28 11 11 1 1 4.800 4.800 0.970 0.613 
18 2 6 46 35 12 2 3 1.820 2.340 2.600 0.050 
19 3 12 47 30 12 1 4 4.558 7.832 31.867 0.013 
20 7 8 35 22 12 1 2 7.468 9.563 1.704 1.389 
21 3 9 41 27 9 1 1 6.600 7.600 4.820 0.602 
22 4 17 30 8 12 1 1 12.850 13.858 32.807 2.221 
23 6 12 38 21 12 1 1 5.800 5.802 10.305 1.588 
24 3 11 48 32 12 1 3 7.479 19.362 12.652 5.082 
25 3 10 36 21 12 1 1 5.700 8.000 7.631 1.846 
26 3 12 45 28 12 1 1 12.000 17.200 14.392 0.914 
27 6 8 44 31 6 1 1 3.578 4.091 6.649 2.483 
28 4 10 44 29 12 1 3 9.600 9.600 6.995 0.837 
29 3 3 46 38 12 1 3 3.686 10.425 9.138 1.274 
30 4 12 26 9 12 1 3 6.480 6.512 2.933 —0.275 
31 5 12 50 33 12 1 4 6.383 7.675 38.260 1.092 
32 4 8 46 33 11 1 1 5.610 12.418 12.661 1.157 
33 5 8 33 20 12 1 1 6.000 6.079 0.820 0.340 
34 4 12 41 24 12 1 2 6.300 6.979 21.286 0.373 
35 5 17 33 11 12 1 1 10.513 10.517 9.723 3.307 
36 4 12 41 24 12 1 2 30.000 30.996 95.187 10.668 
37 3 12 29 12 11 2 1 3.427 5.283 0.171 1.105 
38 9 11 27 11 12 1 2 8.500 8.511 3.105 3.500 
39 5 12 42 25 12 1 1 . 11.300 12.700 7.385 0.541 
40 5 16 39 18 12 1 3 16.960 16.770 16.049 3.020 
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Note: V3, V4, V5 are in years; for V7, 1 = white, 2 = black; for V8, 1 = northeast, 2 = 
northcentral, 3 — south, 4 — west; V9, V10, V11, V12 are in thousands of current dollars. 


vil 


у? 


Wo ou OO) > оо OQ oo Боо ND Otho мо > Q) O0 ho О)» Qo >» AUAA ох олмоор Lh HD 


v3 


V4 


V5 


V6 


v7 


Pee кз кш кы кш ыз em кюч 


кэ – ND кә кы кш юч кә юз мш 


ке к, ко ко — мш — m NN 


кч кы кз кз — кмш о юш кмш мз кш 


v8 


Q5 Qo м ANH LHL кє к MD GO NO ROR Re RR л оо ш оо м м м шо шо мю NH OR OO PNM PN н 


v9 


V10 


8.800 
5.975 
6.265 
8.520 
24.226 
0.750 
7.356 
9.000 
14.660 
5.593 


11.841 
7.700 
10.550 
13.700 
12.242 
7.803 
9.879 
9.154 
7.067 
4.496 


4.636 
9.003 
13.820 
8.891 
8.632 
8.385 
5.403 
8.573 
6.516 
6.000 


16.778 
9.504 
8.953 
8.703 

12.667 
6.504 
8.180 

11.600 
5.602 

10.390 


Vil 
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Table A.3 (continued) 


V1 v2 V3 V4 V5 V6 V7 V8 V9 У10 Vil V12 
81 4 16 44 23 12 1 2 27.000 30.610 51.892 4.115 
82 4 9 34 20 9 1 1 1.500 3.941 1.260 2.575 
83 7 10 39 24 12 1 3 1.789 2.936 17.128 —0.119 
84 5 12 39 22 12 1 4 11.068 11.068 11.542 —5.577 
85 4 14 29 10 12 1 4 8.338 8.338 2.272 2.750 
86 3 8 38 25 12 1 3 2.943 6.683 6.100 0.095 
87 5 10 30 15 12 1 1 7.212 7.212 0.857 1.348 
88 3 10 50 35 12 1 1 7.500 10.411 3.678 0.178 
89 2 8 33 20 12 1 3 5.250 8.850 1.650 —0.695 
90 4 9 35 21 12 1 1 5.066 8.334 2.143 0.787 
91 3 16 36 15 12 1 2 12.848 13.923 18.182 4.642 
92 4 12 33 16 12 1 2 6.214 6.214 0.275 1.260 
93 6 20 38 13 12 1 1 12.202 12.323 28.953 2.687 
94 4 12 46 29 12 1 2 8.190 14.963 11.230 0.720 
95 4 16 50 29 12 1 2 7.200 10.060 25.462 5.109 
96 2 16 54 33 12 1 1 30.000 32.080 98.033 1.800 
97 5 12 31 14 12 1 2 9.190 9.260 5.539 1.684 
98 2 18 27 4 12 1 2 7.500 10.450 2.860 1.475 
99 5 12 40 23 12 1 3 7.852 9.138 11.197 0.566 
100 6 18 34 11 12 1 1 12.000 12.350 30.906 . 25.405 


Source: T. W. Mirer, Economic Statistics and Econometrics, 2d ed. (New York: Macmillan, 1988), 
pp. 18-23. 
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Table 4.4 Noncentral chi-square: complement 
of cumulative distribution function. 


м k=1 k=2 k= 3 
0.000 0.050 0.050 0.050 
0.500 0.109 0.090 0.081 
1.060 0.170 0.133 0.116 
1.500 0.232 0.178 0.153 
2.000 0.293 0.226 0.192 
2.500 0.353 0.274 0.233 
3.000 0.410 0.322 0.275 
3.500 0.465 0.369 0.317 
4.000 0.516 0.415 0.359 
4.500 0.564 0.460 0.400 


5.000 0.609 0.504 0.440 


Note: The entries are the values of 1 — G#(c,; 4°), 
where Gf(-; А?) is the cdf of the noncentral chi-square 
distribution with degrees of freedom parameter k and 
noncentrality parameter \?, and c, satisfies G#(c,; 0) = 
0.95. Thus c, = 3.841, c, = 5.991, сз = 7.815. The 
table was constructed by using the GAUSS command 
“cdfchinc(c,k,m)”, which gives Gf(c; m°). 
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Table A.5 TIM data set. 


V] = ID number V7 = Real disposable personal income 
V2 = Year — 1900 V8 = Change in GNP price index 
V3 = GNP price index V9 = Change in consumer price index 
V4 = Real GNP V10 = Unemployment rate 
V5 = Real gross private domestic V11 = Money stock (M1) 

investment V12 = Treasury bill rate 
V6 = Real personal consumption V13 = Corporate bond rate 


(Moody’s Aaa) 
Note: V3 is equal to 100 in 1972; V4, V5, V6, V7 are in billions of 1972 dollars; V11 is in 
billions of current dollars; V8, V9, V12, V13 are in percent per year; V10 is in percent. 


УІ V2 V3 V4 V5 V6 у? v8 V9 VIO VII У12 VIS 


56 62.79 671.6 102.6 405.4 446.2 3.205 1496 4.1 135.0 2.658 8.36 
57 64.93 683.8 97.0 413.8 455.5 3.408 3.563 4.3 133.8 3.967 3.89 
58 66.04 680.9 87.5 418.0 460.7 1.710 2.728 6.8 138.9 1.839 3.79 
59 67.60 721.7 108.0 440.4 479.7 2.362 0.808 5.5 141.2 3405 438 
60 68.70 737.2 104.7 452.0 489.7 1.627 1.604 5.5 1422 2.928 4.41 


61 69.33 756.6 103.9 461.4 503.8 0.917 1.015 67 146.7 2.378 4.35 
62 70.61 800.3 117.6 482.0 524.9 1846 1116 5.5 149.4 2.778 4.33 
63 71.67 832.5 125.1 500.5 5423 1.501 1.214 5.7 1549 3.157 4.96 
64 72.77 876.4 133.0 528.0 580.8 1.535 1.309 5.2 162.0 3.549 4.40 
65 74.36 929.3 151.9 557.5 616.3 2.185 1.722 45 169.6 3.954 4.49 


11 66 76.76 984.8 163.0 585.7 646.8 3.228 2.857 3.8 173.8 4.881 5.18 
19 67 79.06 1011.4 154.9 602.7 673.5 2.996 2.881 3.8 185.2 4.321 5.51 
13 68 82.54 1058.1 161.6 634.4 701.3 4.402 4.200 3.6 199.5 5.339 6.18 
14 69 86.79 1087.6 171.4 6579 7225 5.149 5.374 3.5 205.9 6.677 7.03 
15 70 91.45 1085.6 158.5 672.1 7516 5.369 5.920 4.9 2168 6.458 8.04 


16 71 96.01 1122.4 173.9 696.8 779.2 4.986 4.299 5.9 231.0 4.348 7.39 
17 72 100.00 1185.9 195.0 737.1 810.3 4.156 3.298 56 2524 4.071 7.21 
18 73 105.69 1255.0 217.5 768.5 865.3 5.690 6.225 49 2664 7.041 7.44 
19 74 114.92 1248.0 195.5 768.6 858.4 8.733 10.969 5.6 278.0 7.886 8.57 
20 75 125.56 1233.9 154.8 780.2 875.8 9.259 9.140 85 291.8 5.838 8.83 


21 76 132.11 1300.4 184.5 823.7 907.4 5.217 5.769 7.7 3111 4.989 8.43 
22 77 139.83 1371.7 218.5 863.9 939.8 5.844 6452 7.1 336.4 5.265 8.02 
23 78 150.05 1436.9 229.7 9048 981.5 7.309 7.658 61 3649 7.221 8.78 
24 79 162.77 1483.0 232.6 930.9 1011.5 8.477 11.259 5.8 390.5 10.041 9.63 
25 80 177.86 1480.7 203.6 935.1 1018.4 8.964 13.523 7.1 415.6 11.506 11.94 


= 
© uo oo -10 AN 


Source: T. W. Mirer, Economic Statistics and Econometrics, 2d ed. (New York: Macmillan, 1988), 
pp. 24—25. 
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Table А.6 GEWE data set (V2—V7 in millions of 1947 dollars). 


V1 = ID number 
V2 = GE investment 


V3 = GE market value 
V4 — GE lagged capital stock 


V1 V2 
1 33.1 
2 45.0 
3 77.2 
4 44.6 
5 48.1 
6 74.4 
7 113.0 
8 91.9 
9 61.3 

10 56.8 

1] 93.6 

12 159.9 

13 147.2 

14 146.3 

15 98.3 

16 93.5 

17 135.2 

18 157.3 

19 179.5 

20 189.6 


V3 


1170.6 
2015.8 
2803.3 
2039.7 
2256.2 
2132.2 
1834.1 
1588.0 
1749.4 
1687.2 


2007.7 
2208.3 
1656.7 
1604.4 
1431.8 
1610.5 
1819.4 
2079.7 
2371.6 
2759.9 


V4 


97.8 
104.4 
118.0 
156.2 
172.6 
186.6 
220.9 
287.8 
319.9 
321.3 


319.6 
346.0 
456.4 
543.4 
618.3 
647.4 
671.3 
726.1 
800.3 
888.9 


V5 = WE investment 
V6 = WE market value 
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V7 = WE lagged capital stock 


V5 


12.9 
25.9 
35.1 
22.9 
18.8 
28.6 
48.5 
43.3 
37.0 
37.8 


39.3 
53.5 
55.6 
49.6 
32.0 
32.2 
54.4 
71.8 
90.1 
68.6 


V6 


191.5 
516.0 
729.0 
560.4 
519.9 
628.5 
537.1 
561.2 
617.2 
626.7 


737.2 
760.5 
581.4 
662.3 
583.8 
635.2 
723.8 
864.1 
1193.5 
1188.9 


V 


Source: H. Theil, Principles of Econometrics (New York: John Wiley & Sons, 1971), 


p. 296. 


Appendix B 
Getting Started in GAUSS 


These notes, adapted from material provided by Aptech Systems Inc., 
provide some important information about getting started using 
GAUSS. They refer to Version 1.49B. They do not in any way provide 
complete documentation, even for the topics covered. 


Notation 


Denotes the DOS prompt. 

{) Denotes a key on the keyboard. For example, (F2) denotes 
Function Key 2, while ( — ) denotes two keys pressed 
simultaneously, for example, (Ctrl-F1). 

> Denotes the GAUSS prompt. 

< Denotes the GAUSS program terminator. 


Editing and Running Programs 


1. To get into GAUSS from the operating system: > gauss (Enter) 
There may be a special command provided in your system. 

2. To get out of GAUSS into the operating system: (Esc) 

3. GAUSS has two modes of operation: COMMAND MODE and 
EDIT MODE. When you first get into GAUSS you are in COM- 
MAND MODE, as indicated by FILE=COMMAND at the bottom 
of the screen. 

4. In COMMAND MODE you can write and run interactive pro- 
grams. After the GAUSS prompt >, start writing GAUSS state- 
ments. End them with semicolons. After the last statement in the 
program, press (F4), and then (F2). For example: 


> xl=rndu(100,3); x2=rndu(100,1); x=x17x2; x; (F4) (F2) 
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10. 


11. 
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creates two random matrices and then concatenates them hori- 
zontally. The result, x, is a 100 X 4 matrix. 


. GAUSS does not care about blank spaces (with only a few excep- 


tions), or about blank lines in programs. It does not distinguish 
uppercase and lowercase letters. 


. To get into EDIT MODE, use the command "edit" followed by 


the name of the file you want to edit. For example: 


>> edit myprog; (F4) (F2) 


To get out of EDIT MODE and back into COMMAND MODE, 
use a function key: 


(Fl) SAVE: saves the file you are editing, without running it. 

(F2) RUN: saves the file you are editing, and will try to run it. 

(F4) QUIT: drops you back to COMMAND MODE without 
saving file but clearing screen. 


. After a program file has been run from EDIT MODE, and you 


are back in COMMAND MODE, you can return to EDIT MODE 
to re-edit that file by pressing (CTRL-F1). 


. Programs written in EDIT MODE are just like those in COM- 


MAND MODE, except that they do not contain the GAUSS 
prompt Ж and program terminator <. To run a program from 
EDIT MODE, press (F2). Programs in EDIT MODE are auto- 
matically saved in a file when they are run. 


. To print output on the screen, write the name of the matrix. To 


print matrix x, for example, write: > x; 

The format of the output can be changed by using format w,p; 
here w is the width of the field for each number, and p is the 
number of decimal places. For example: > format 8,2; 

To send output to a file, use a command such as: 


>> output file = myoutput.out reset; 


After this command is executed, anything printed to the screen 
will also be sent to the file myoutput.out (which is first cleared). 


. Enclose comments with a combination of slashes and asterisks: 


/* */, For example: 


/*This is a comment; it will not be executed in a program.*/ 
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13. To use a DOS command, precede it with the word "dos". For 
example: 


> dos dir; 2 dos del myfile; > dos copy a:myfile c:; 
14. Mathematical operators: 


+ = */ Perform in the standard way on scalars (add, subtract, 
multiply, divide). On matrices, +, —, and * have the 
standard definitions. 

S Perform element-by-element multiplication and division, 
respectively, of matrices. 

^ Performs element-by-element exponentiation (raising to 


a power) of the elements of a matrix. 


15. Matrix operators: 


Concatenates matrices horizontally. 
| Concatenates matrices vertically. 
[ Transposes a matrix (interchanges rows and columns). 


16. Mathematical functions: 


cols(x) Gives the number of columns in matrix x. 
exp(x) Raises e to powers given by elements of x. 
In(x) Natural logs (base e) of elements of x. 
log(x) Common logs (base 10) of elements of x. 
meanc(x) Means of the columns of x. 

rows(x) The number of rows in matrix x. 

sqrt(x) Square roots of elements in x. 

sumc(x) Sums of the columns of x. 


17. Defining matrices: 


eye(k) k X k identity matrix. 
let Allows matrices to be defined explicitly: 
let x[2,2] = 1 8 -12 15; [a 2 X 2 matrix] 
letx = 1579; [a 4 X 1 vector] 
let x = dog cat; [matrix with character elements] 
ones(nikk) п X k matrix of 15. 
rndn(nik) n X К matrix of standard normal random variables. 
zeros(n,k) n X К matrix of 0’s. 


‚ 18. Indexing elements of a matrix: 
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x[i,j] The i,j element of x. 

х[.,] jth column of x. 

x[i..] ith row of x. 

x[m:njj] Rows m through n of column j of x. 

x[rv,cv] Rows of x specified in the vector rv, columns of x speci- 


fied in the vector cv. 
19. Loading and saving matrices: 


save x; Save matrix x as file named x.fmt. 
load x; Load matrix x from file named x.fmt. 


20. Flow control: 


do until... ;... ; endo; Do loop. For example: 
i=1; do until i > 10; i=i+1; endo; 
if...;...3 endif; If statement. For example: 


if age[i,1] < 10; dage[i,1]=1; 

elseif age[i,1] > 10 and age [1,1] <= 20; 
dage[i, 1]=2; 

else; dage[i,1]=3; endif; 


Simple Exercises 
To do these exercises, first get into GAUSS. After GAUSS is loaded 
into memory, the GAUSS prompt will appear on the screen, preceded 
by the letter of the default disk drive. Type the exercises below, exactly 
as they are written. The exercises are all indented and begin with the 
GAUSS prompt >. They all end with (F4) (F2), the keystrokes that tell 
GAUSS that you are done with the program, and that you want it run. 
When you press (F4) (F2), the GAUSS program terminator symbol < 
appears on the screen. 

‘The exercises are sequential, in that each uses results in memory that 
have been created by the preceding ones. 


1. Generate a matrix of random numbers, x, and print it out: 
> x=rndn(2,2); x; (F4) (F2) 
2. Generate another matrix, y, multiply x and y, and print the result: 


> y-rndu(2,2); y; z=x*y; z; (F4) (F2) 


10. 


11. 


12. 


Getting Started in GAUSS 395 


. Define a 2 X 2 matrix using a specified set of numbers: 


> let w[2,2]=1 2 3 4; w; (F4) (F2) 


. Sum the elements of each column of w, and find the means of 


each column: 


>> sumc(w); meanc(w); (F4) (F2) 


. Recover the screen as it was before the last exercise by pressing 


(F1). Edit the code by placing a transpose operator symbol before 
each semicolon: 


>> sumc(w)’; meanc(w)'; (F4) (F2) 


. Sum the elements of each column of w, assign the result to sw, 


and print: 


>> sw=sumc(w); sw; (F4) (F2) 


. Define a 4 X 1 vector using a specified set of numbers. The new 


vector will be given the name w, so that the old w vanishes: 


> let w=1 2 3 4; w; (F4) (F2) 


. Generate two matrices with specified elements, and sum them: 


> let х[2,2]=10 9 8 7; let y[2.2]- 71 -2 -3 –4; x; y; 
w=xty;w; (F4) (F2) 


. Multiply each element in x by each element in y; divide each 


element in y by each element in x: 
A x .* y; y ./ x; (F4) (F2) 


Pull out the first row of x, and assign it to w; then pull out the 
first column of x, and assign it to z; then print both w and z: 


> wex[l.] z=x[.,1]; w; z; (F4) (F2) 
Concatenate x and y. First do it horizontally, then vertically: 
> zl = xy; 29 = х|у; zl; z2; (F4) (F2) 


Generate and print a 2 x 2 identity matrix, a 2 x 2 matrix of 
I's, and a 2 X 2 matrix of 0’s: 


> k=2; i=eye(k); u=ones(k,k); z=zeros(k,k); i; и; z;(F4) (F2) 


396 


14. 


15. 


16. 


17. 


Appendix B 


. Save a matrix to a file: 


2» save u; (F4) (F2) 
Look for the file u.fmt on the default drive: 
>> dos dir u.fmt; (F4) (F2) 


Set the matrix u equal to scalar 0, print it, then load u from 
memory and print it again: 


>> u-0; и; load u; и; (F4) (F2) 
Find out how many rows and columns a matrix has, and print: 


>> format 1,0; "The matrix x has " rows(x) " rows, and " 
cols(x) " columns."; (F4) (F2) 


Using a loop, generate a sequence of numbers, and print the 
numbers divided by 10: 


>> i=0; format 2,1; 
do until i233; print i/10;; i=i+1; endo; (F4) (F2) 


References 


Amemiya, T. 1985. Advanced Econometrics. Cambridge, Mass.: Harvard Uni- 
versity Press. 

Conlisk, J. 1971. *When collinearity is desirable." Western Economic Journal 
9:393—407. 

DeGroot, M. H. 1975. Probability and Statistics. Reading, Mass.: Addison- 
Wesley. 

Frisch, R., and F. V. Waugh. 1933. “Partial time regressions as compared 
with individual trends.” Econometrica 1:387—401. 

Goldberger, A. S. 1964. Econometric Theory. New York: John Wiley & Sons. 

Gouriéroux, C., A. Holly, and A. Monfort. 1982. “Likelihood ratio test, 
Wald test, and Kuhn-Tucker test in linear models with inequality con- 
straints on the regression parameters.” Econometrica 50:63—80. 

Greene, W. H. 1990. Econometric Analysis. New York: Macmillan. 

Intriligator, M. D. 1978. Econometric Models, Techniques, and Applications. 
Englewood Cliffs, N.J.: Prentice-Hall. 

Johnston, J. J. 1984. Econometric Methods. 3d ed. New York: McGraw-Hill. 

Judge, G. G., R. C. Hill, W. E. Griffiths, H. Lütkepohl, and T.-C. Lee. 1988. 
Introduction to the Theory and. Practice of Econometrics. 2d ed. New York: 
John Wiley & Sons. 

Kakwani, N. C. 1967. "The unbiasedness of Zellner's seemingly unrelated 
regression equations estimators.” Journal of the American Statistical Asso- 
ciation 62:141—142. 

Kosobud, R., and J. N. Morgan, eds. 1964. Consumer Behavior of Individual 
Families over Two and Three Years. Ann Arbor: Institute for Social 
Research, The University of Michigan. 

Leamer, E. E. 1983. "Let's take the con out of econometrics." American 
Economic Review 73:31—43. 

Lovell, M. 1983. “Data mining.” Review of Economics and Statistics 65:1—12. 


398 References 


McCloskey, D. N. 1985. “The loss function has been mislaid: the rhetoric 
of significance tests." American Economic Review 75:201—205. 

Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in Economet- 
rics. London: Cambridge University Press. 

Manski, C. F. 1988. Analog Estimation Methods in Econometrics. New York: 
Chapman and Hall. 

Marschak, J. 1953. “Economic measurements for policy and prediction,” 
pp. 1-26 in W. C. Hood and T. C. Koopmans, eds., Studies in Econometric 
Method. New York: John Wiley & Sons. 

Mirer, T. W. 1988. Economic Statistics and Econometrics. 2d ed. New York: 
Macmillan. 

Rao, C. R. 1973. Linear Statistical Inference and Its Applications. 2d ed. New 
York: John Wiley & Sons. 

Theil, H. 1971. Principles of Econometrics. New York: John Wiley & Sons. 

Wallace, T. D., and V. G. Ashar. 1972. “Sequential methods in model 
construction.” Review of Economics and Statistics 54:172-178. 

Wolak, F. A. 1987. “An exact test for multiple inequality and equality 
constraints in the linear regression model.” Journal of the American Sta- 
tistical Association 82:782—793. 

Zellner, A. 1962. “An efficient method of estimating seemingly unrelated 
regressions and tests for aggregation bias.” Journal of the American Sta- 
tistical Association 57:348—368. 


Index 


adjusted 
coefficient of determination, 178 
mean squared residual, 167 
sample variance, 120 
Aitken's Theorem, 294 
alternative hypothesis, 214 
Amemiya, T., 206, 243, 299, 300, 374 
analog estimator, 117 
analogy principle, 117 
analysis 
of sums of squares, 176 
of variance, 48 
of variation, 176 
approximate 
confidence interval, 123 
standard error, 123 
approximation to CEF, 53, 151 
AR (autoregressive processes), 282—284 
ARMA (autoregressive-moving average 
process), 284 
Ashar, V. G., 260 
asymptotic 
criteria, 121 
distribution, 99 
efficiency, 122 
expectation, 100 
properties, 94 
standard error, 123 
variance, 100 
asymptotics with nonstochastic X, 242— 
243 
autocorrelated variable, 277 
autocorrelation and autocovariance, 
278 


autoregressive case of GCR model, 302 

autoregressive-moving average process 
(ARMA), 284 

autoregressive processes (AR), 282-284 

auxiliary regression, 184 


BAN (best asymptotically normal), 122 
Bernoulli distribution, 12 
best asymptotically normal (BAN), 122 
best linear predictor (BLP), 52, 151 
best proportional predictor (BPP), 57 
bias, 118 
binary response, 144, 309 
binomial distribution, 13 
bivariate 
Central Limit Theorem, 109 
Delta Method, 110 
Law of Large Numbers, 109 
normal distribution, 74 
probability distribution, 34 
BLP (best linear predictor), 52, 151 
BPP (best proportional predictor), 57 
BVN (bivariate normal), 74 


Cl, C2 (convergence theorems), 98 

Cauchy-Schwarz Inequality, 66 

causality, 173, 340, 346 

cdf (cumulative distribution function), 
15, 36 

CEF (conditional expectation function), 
49, 150 

censored dependent variable, 310 

Central Limit Theorem (CLT), 99, 109 

central moment, 27, 44 


400 


changing expectation, 277 
characteristic roots and vectors, 200 
Chebyshev Inequalities, 31 
chi-square distribution, 83, 87, 382—383 
Chow test, 237 
classical normal regression model 
(CNR), 204 
classical regression model (CR), 163 
CLT (Central Limit Theorem), 99, 109 
cmf (conditional mean function), 6 
CNR (classical normal regression), 204 
Cobb-Douglas production function, 233 
coefficient of determination, 66, 177 
coefficient vector, 154 
collinearity, 208, 245 
concave and convex functions, 32 
concentrated log-likelihood function, 
334 
conditional 
expectation, 46—47 
expectation function (CEF), 49, 150 
frequency distribution, 3 
mean, 5 
mean function (cmf), 6 
median function, 56 
probability distribution, 38—40 
variance function (CVF), 49 
confidence interval, 123 
confidence region, 208 
Conlisk, J., 251 
consistent estimator, 121 
consumption function, 234 
continuous 
probability distribution, 14, 35 
uniform distribution, 16 
convergence, 97—98 
correlation coefficient, 45 
correlation ratio, 66 
covariance, 45 
covariance of linear functions, 46 
covariance matrix, 161 
CR (classical regression), 163 
Cramér-Rao Inequality (CRI), 129 
critical value, 214 
cumulative distribution function (cdf), 
15, 36 
curved-roof distribution, 41 
CVF (conditional variance function), 49 


Index 


р1-р5, D6—D10 (distribution results), 
206—207, 223—225 
degenerate distribution, 69, 77, 98, 198 
degrees of freedom, 87 
DeGroot, M. H., 99 
Delta Method, 102, 110 
demand function, 233 
deterministic relation, 1 
deviations from means, 188 
discrete 
probability distribution, 11, 34 
uniform distribution, 13 
distributions: 
Bernoulli, 12 
binomial, 13 
bivariate normal (BVN), 74 
chi-square, 83, 87, 382-383 
continuous uniform, 16 
curved-roof, 41 
discrete uniform, 13 
exponential, 16 
F, 199 
multinormal (multivariate normal), 
195—196 
non-BVN, 77 
noncentral chi-square, 219, 387 
normal, 23, 68—69, 195-196 
Poisson, 13-14 
power, 18 
rectangular, 16 
roof, 36 
Snedecor F, 199 
standard bivariate normal (SBVN), 
70 
standard logistic, 18 
standard normal, 16, 381 
Student’s t, 88 
three-point, 63 
trinomial, 35 
univariate normal, 68—69 
distribution of function, 20—23 
disturbance vector, 170 
double residual regression, 186 
Durbin-Watson statistic, 305 


economic significance, 240 
eigenvalues and eigenvectors, 200 
empirical relation, 2 


Index 


endogenous variable, 340 

equicorrelated process, 287-288 

estimate and estimator, 116 

exact collinearity, 245 

exclusions, 361 

exogenous variable, 340 

expectation, 26, 160 

expectation of function, 26, 28, 44, 45, 
161 

expected value, 26 

explicit selection, 145 

exponential distribution, 16 


F distribution, 199 

F1—F4 (features of normal sampling), 
91 

F1*—F4* (features of standard normal 
sampling), 90 

FGLS (feasible generalized least 
squares), 297 

FIML (full-information maximum like- 
lihood), 368 

first-order autoregressive case of GCR 
model, 302 

first-order processes (AR(1), MA(1)), 
282—283 

fitted-value vector, 154 

fixed explanatory variables, 147 

FOC (first-order condition), 135 

frequency distribution, 3 

Friedman's hypothesis, 338 

Frisch, R., 186 

full-information maximum likelihood 
(FIML), 368 

full-rank case, 154 

fully recursive model, 362 

function, 1 


GAUSS, 180 
Gauss-Markov Theorem, 165 
GCR (generalized classical regression), 
292 
general linear hypothesis, 233 
generalized 
classical normal regression model, 
298 
classical regression model (GCR), 292 


401 


least squares (GLS), 294 

neoclassical regression model, 298 
GEWE data set, 335, 389 
GLS (generalized least squares), 294 
Goldberger, A. S., 287 
goodness of fit, 176 
Gouriéroux, C., 238 
Greene, W. H., 243, 288, 297, 300, 314, 

318 


heteroskedasticity, 300 

heteroskedasticity-corrected standard 
errors, 272 

homoskedasticity, 141 

hypothesis test, 214 


11-13 (independence theorems), 59—60 
ideal 
sample covariance, 107 
sample slope, 111 
sample variance, 85 
idempotent matrix, 155 
identical explanatory variables, 327 
identically distributed variables, 60, 81, 
106 
identification, 355, 356, 362 
ILS (indirect least squares), 342 
importance, 240—241 
indefinite matrix, 158 
independence, 58—59, 60, 81, 106 
indirect feasible generalized least 
squares (indirect-FGLS), 366—368 
indirect least squares (ILS), 342 
information rule and variable, 131 
instrumental-variable analogy, 139 
instrumental-variable estimator (IV), 
143 
Intriligator, M. D., 230 
invariance property, 136 
iterative FGLS (iterative feasible gener- 
alized least squares), 334 
IV (instrumental variable), 143 
IZEF (iterative Zellner efficient esti- 
mator), 334 


Jensen's Inequality, 32 
Johnston, J. J., 171, 172, 246-247 


402 


joint 
confidence region, 209 
cumulative distribution function, 36 
frequency distribution, 3 
moments, 44 
null hypothesis, 216 
probability distribution, 34—85 
joint-conditional distribution, 362 
joint-marginal distribution, 150 
Judge, G. G., 170, 171, 172, 206, 243, 
245, 248, 260, 285, 297, 300, 305, 
314, 318, 337 


Kakwani, N. C., 331 
Keynesian model, 340 
Kosobud, R., 2 
Kronecker product, 325 


Law of Iterated Expectations, 47 
Law of Large Numbers (LLN), 99, 109 
Leamer, E. E., 261 
least-squares (LS) 
analogy, 139 
estimator, 165 
linear regression, 152 
property, 114 
likelihood function, 134 
limiting distribution, 98 
linear 
approximation to CEF, 53, 151 
CEF, 54, 171 
function of normal variables, 69, 76— 
77, 198 
function rules, 28, 45, 46 
projection (LP), 52, 151 
regression, 152 
relation, 54, 65 
LLN (Law of Large Numbers), 99, 109 
logistic model, 310 
log-likelihood variable, 128 
long regression, 184 
Lovell, M., 262 
LP (linear projection), 52 
LS (least squares), 152 


M1-M4 (mean-independence theo- 
rems), 62—64 


Index 


MA (moving average processes), 282, 
283 
McCloskey, D. N., 240 
Maddala, G. S., 319 
Manski, C. F., 117, 313 
marginal 
expectation, 48 
frequency distribution, 3 
mean, 5 
probability distribution, 37—38 
significance level, 239 
Markov Inequality, 31 
Marschak, J., 343 
mass points, 11, 34 
maximum likelihood (ML), 134 
mean, 5 
mean-independence, 61 
mean squared error (MSE), 29, 118 
mean squared error matrix, 256 
median, 30 
method of moments, 117 
micronumerosity, 249 
minimum-distance procedure, 368 
minimum variance linear unbiased esti- 
mator (MVLUE), 120, 165—166 
minimum variance unbiased estimator 
(MVUE), 119 
Mirer, T. W., 192, 290, 386, 388 
miss vector, 218 
mixed probability distribution, 19, 40 
ML (maximum likelihood), 134 
models: 
A, B, C, 357-360 
classical normal regression (CNR), 
204 
classical regression (CR), 163 
fully recursive, 362 
generalized classical normal regres- 
sion; 298 
generalized classical regression 
(GCR), 292 
generalized neoclassical regression, 
298 
Keynesian, 340 
logistic, 310 
multivariate regression, 323 
neoclassical normal regression 
(NeoCNR), 269 


Index 


neoclassical regression (NeoCR), 264 
permanent income, 338 
probit, 144, 309, 317 
regression-system, 323 
seemingly unrelated regressions 
(SUR), 323 
simultaneous-equation (SEM), 351 
stationary population (SP), 278-279 
supply-demand, 349, 357-360 
Tobit, 310 
moments, 27, 44 
Morgan, J. N., 2 
moving average processes (MA), 282, 283 
MSE (mean squared error), 118 
multicollinearity, 245 
multinormal (multivariate normal) dis- 
tribution, 196 
multiple regression, 150 
multivariate population, 150 
multivariate regression model, 323 
MVLUE (minimum variance linear 
unbiased estimator), 120, 165—166 
MVUE (minimum variance unbiased 
estimator), 119 


NeoCNR (neoclassical normal regres- 
sion), 269 

NeoCR (neoclassical regression), 264 

NLLS (nonlinear least squares), 143 

non-BVN distribution, 77 

noncentral chi-square distribution, 219, 
387 

noncentrality parameter, 219 

nonlinear CEF, 142, 308 

nonlinear least squares (NLLS), 143 

nonnegative definite matrix, 158 

nonstationary process, 288 

nonstochastic explanatory variables, 
147, 164 

normal distribution, 23, 68—69, 195— 
196 

normal equations, 154 

normalizations, 361 

null hypothesis, 214, 216 


omitted variables, 190 
one-sided alternative and one-tailed 
test, 237 


403 


orthogonal explanatory variables, 185, 
190, 327 

orthogonality analogy, 139 

orthonormal matrix, 200 


P1—P5 (properties of normal distribu- 
tions), 76—77, 197—198 
pdf (probability density function), 14, 
35 
permanent income model, 338 
plim (probability limit), 98 
pmf (probability mass function), 12, 34 
Poisson distribution, 13—14 
population, 6 
population moments, 284 
population regression function, 49, 150 
positive definite matrix, 158 
power distribution, 18 
power of test, 218 
prediction, 30, 51-52, 151, 175 
pretest estimation, 258 
probability 
axioms, 11 
density function (pdf), 14, 35 
limit (plim), 97—98 
mass function (pmf), 12, 34 
probit model, 144, 309, 317 
process parameters, 284 
pure heteroskedasticity case of GCR 
model, 300 
P-value, 239 


01-04 (quadratic form theorems), 
200-202, 219 
quadratic form, 163 


R1-R6 (rules on matrix expectations), 
161-163 
random 
sample, 60, 80, 106, 171 
variable, 11 
vector, 34 
walk, 288 
Rao, C. R., 102 
ratio of sample means, 110 
raw moment, 27, 44 
rectangular distribution, 16 
reduced form, 340, 349 


404 


reduced-form disturbance, 352 
regression fishing, 261 
regression F-statistic, 230 
regression strategy, 258 
regression-system model, 323 
residual regression, 185 
residual vector, 155 
restrictions, 236, 331, 355 
reverse CEF, 146 

roof distribution, 36 

R? (coefficient of determination), 177 


$1—S5 (Slutsky theorems), 102 
sample, 6 
autocorrelation and autocovariance, 
285 
covariance, 106—107 
covariance vector, 271 
linear projection, 111 
maximum, 82 
mean, 82 
, moments about population mean, 85 
moments about sample mean, 82 
proportion, 82 
raw moments, 82 
slope, 111 
space, 11 
statistic, 82 
t-ratio, 101 
variance, 82 
variance matrix, 271 
Sample Mean Theorem, 84 
sampling distribution, 82 
savings rate-income data set, 2 
SBVN (standard bivariate normal), 70 
SCF data set, 192-193, 384—386 
score variable, 128 
seasonal adjustment, 187 
second-order processes (AR(2), MA(2)), 
283—284 
seemingly unrelated regressions model 
(SUR). 323 
selection bias, 147 
selective sampling, 145 
SEM (simultaneous-equation model), 
351 
short-rank case, 154 
short regression, 184 


Index 


significance level, 214 
significant difference, 215 
simultaneity, 337—338 
simultaneous-equation model (SEM), 
351 
Slutsky theorems, 101—102 
Snedecor F distribution, 199 
SP (stationary population), 278—279 
stacking, 324—325 
standard 
bivariate normal distribution (SBV N), 
70 
deviation, 45 
error, 123 
error of forecast, 176 
logistic distribution, 18 
"normal distribution, 16, 381 
normal vector, 199 
standardized sample mean, 94 
stationarity, 279 
stationary population model (SP), 278— 
279 
stochastic independence, 58-59 
stochastic process, 281 
stratified sampling, 147, 172 
structural 
change, 237 
disturbance, 351 
equations, 351 
form, 349 
parameters, 339 
Structure vs. regression, 343 
Student's ¢ distribution, 88 
Submatrix of Inverse Theorem, 191— 
192 
summer vector, 178 
supply-demand models, 349, 357—360 
SUR (seemingly unrelated regressions), 
323 
systematic part, 341 


T1-T4 (theorems on expectations), 28- 
29 

Т5-Т12 (theorems on expectations in 
bivariate distribution), 45—49 

T13-T14 (theorems on conditional 
expectation function), 53-54 

t-distribution, 88 


Index 405 


t-ratio, 101 variance, 27 

t-statistic, 124 variance of linear function, 28, 45, 161 
test statistic, 214 variance matrix, 160 

Theil, H., 330, 335, 389 variance-independence, 141 
three-point distribution, 63 variation, 176 

3SLS (three-stage least squares), 374 varying marginals, 147 


TIM data set, 290, 388 
time series, 274 


Tobit model, 310 Wallace, T. D., 260 
trend removal, 186 Waugh, F. V., 186 
trinomial distribution, 35 Wolak, F. A., 238 


2SLS (two-stage least squares), 343, 369 


unbiased estimator, 118 ZEF (Zellner efficient), 329 - 
unbiased predictor, 30 Zellner, A., 329 

uncorrelated disturbances, 327, 361 zero null subvector hypothesis, 228 
uncorrelatedness, 63 ZES rule (zero expected score), 128 


univariate normal distribution, 68—69 ZES-rule estimator, 132 


ARTHUR S. GOLDBERGER IS VILAS RESEARCH 
OFESSOR OF ECONOMICS AT THE UNIVERSITY OF 
SCONSIN, MADISON, AND IS A MEMBER OF THE 


ATIONAL ACADEMY OF SCIENCES. 


SN 
ANANAS s 
2 УЗЕ 2954 


9 


780674'175440 
ISBN П-Ь7?Ч-175ЧЧ-1 Ё 


